diff mbox series

runqemu: lock tap creation process

Message ID GV1PR07MB91203A774F08D3110C579514A89C2@GV1PR07MB9120.eurprd07.prod.outlook.com
State New
Headers show
Series runqemu: lock tap creation process | expand

Commit Message

Konrad Weihmann Sept. 4, 2024, 12:21 p.m. UTC
in case of running two or more runqemu instances in parallel
with no previously setup tap devices, the following happens:

instance A probes for tap devices, but doesn't find
any, proceeds to generating tap devices with the sudo call,
resulting in tap0.
instance B starts to probes, finds tap0.
Both will lock tap0.

tap0 will be then forwarded to qemu.
Instance A reporting

"Using preconfigured tap device tap0"

but then failing with

qemu-system... could not configure /dev/net/tun (tap0): Device or resource busy

To fix that, lock the entire tap creation process with
an exclusive file (blocking) lock, so only a single instance can
perform the non-atomic changes.

Signed-off-by: Konrad Weihmann <kweihmann@outlook.com>
---
 scripts/runqemu | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Richard Purdie Sept. 5, 2024, 9:56 a.m. UTC | #1
On Wed, 2024-09-04 at 12:21 +0000, Konrad Weihmann via lists.openembedded.org wrote:
> in case of running two or more runqemu instances in parallel
> with no previously setup tap devices, the following happens:
> 
> instance A probes for tap devices, but doesn't find
> any, proceeds to generating tap devices with the sudo call,
> resulting in tap0.
> instance B starts to probes, finds tap0.
> Both will lock tap0.
> 
> tap0 will be then forwarded to qemu.
> Instance A reporting
> 
> "Using preconfigured tap device tap0"
> 
> but then failing with
> 
> qemu-system... could not configure /dev/net/tun (tap0): Device or resource busy
> 
> To fix that, lock the entire tap creation process with
> an exclusive file (blocking) lock, so only a single instance can
> perform the non-atomic changes.
> 
> Signed-off-by: Konrad Weihmann <kweihmann@outlook.com>
> ---
>  scripts/runqemu | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/scripts/runqemu b/scripts/runqemu
> index 2817acb19f..e5db06be37 100755
> --- a/scripts/runqemu
> +++ b/scripts/runqemu
> @@ -1163,6 +1163,9 @@ to your build configuration.
>  
>          self.make_lock_dir(lockdir)
>  
> +        tap_setup_lock = open(os.path.join(lockdir, '_tap_creation_lock'), 'w')
> +        fcntl.flock(tap_setup_lock, fcntl.LOCK_EX)
> +
>          cmd = (ip, 'link')
>          logger.debug('Running %s...' % str(cmd))
>          ip_link = subprocess.check_output(cmd).decode('utf-8')
> @@ -1187,6 +1190,8 @@ to your build configuration.
>  
>          if not tap:
>              if os.path.exists(nosudo_flag):
> +                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
> +                tap_setup_lock.close()
>                  logger.error("Error: There are no available tap devices to use for networking,")
>                  logger.error("and I see %s exists, so I am not going to try creating" % nosudo_flag)
>                  raise RunQemuError("a new one with sudo.")

This doesn't make sense to me. Surely you should be taking this lock
*within* the "if not tap" code block since that is where the ifup is
called out to with sudo? You appear to be dropping the lock there
instead?

It appears you're putting the locking around the search for pre-
configured tap devices which is already atomic and working well from a
race perspective?

> @@ -1198,6 +1203,8 @@ to your build configuration.
>              try:
>                  tap = subprocess.check_output(cmd).decode('utf-8').strip()
>              except subprocess.CalledProcessError as e:
> +                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
> +                tap_setup_lock.close()
>                  logger.error('Setting up tap device failed:\n%s\nRun runqemu-gen-tapdevs to manually create one.' % str(e))
>                  sys.exit(1)
>              lockfile = os.path.join(lockdir, tap)
> @@ -1206,6 +1213,9 @@ to your build configuration.
>              self.cleantap = True
>              logger.debug('Created tap: %s' % tap)
>  
> +        fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
> +        tap_setup_lock.close()

For locks like this, they should really be in a try/finally or with clause...

Cheers,

Richard
Konrad Weihmann Sept. 6, 2024, 8:18 a.m. UTC | #2
On 05.09.24 11:56, Richard Purdie wrote:
> On Wed, 2024-09-04 at 12:21 +0000, Konrad Weihmann via lists.openembedded.org wrote:
>> in case of running two or more runqemu instances in parallel
>> with no previously setup tap devices, the following happens:
>>
>> instance A probes for tap devices, but doesn't find
>> any, proceeds to generating tap devices with the sudo call,
>> resulting in tap0.
>> instance B starts to probes, finds tap0.
>> Both will lock tap0.
>>
>> tap0 will be then forwarded to qemu.
>> Instance A reporting
>>
>> "Using preconfigured tap device tap0"
>>
>> but then failing with
>>
>> qemu-system... could not configure /dev/net/tun (tap0): Device or resource busy
>>
>> To fix that, lock the entire tap creation process with
>> an exclusive file (blocking) lock, so only a single instance can
>> perform the non-atomic changes.
>>
>> Signed-off-by: Konrad Weihmann <kweihmann@outlook.com>
>> ---
>>   scripts/runqemu | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/scripts/runqemu b/scripts/runqemu
>> index 2817acb19f..e5db06be37 100755
>> --- a/scripts/runqemu
>> +++ b/scripts/runqemu
>> @@ -1163,6 +1163,9 @@ to your build configuration.
>>   
>>           self.make_lock_dir(lockdir)
>>   
>> +        tap_setup_lock = open(os.path.join(lockdir, '_tap_creation_lock'), 'w')
>> +        fcntl.flock(tap_setup_lock, fcntl.LOCK_EX)
>> +
>>           cmd = (ip, 'link')
>>           logger.debug('Running %s...' % str(cmd))
>>           ip_link = subprocess.check_output(cmd).decode('utf-8')
>> @@ -1187,6 +1190,8 @@ to your build configuration.
>>   
>>           if not tap:
>>               if os.path.exists(nosudo_flag):
>> +                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
>> +                tap_setup_lock.close()
>>                   logger.error("Error: There are no available tap devices to use for networking,")
>>                   logger.error("and I see %s exists, so I am not going to try creating" % nosudo_flag)
>>                   raise RunQemuError("a new one with sudo.")
> 
> This doesn't make sense to me. Surely you should be taking this lock
> *within* the "if not tap" code block since that is where the ifup is
> called out to with sudo? You appear to be dropping the lock there
> instead?

Right before the raise of an exception the lock should be released, 
shouldn't it?
Otherwise we would rely on the fact that none of the calling functions 
does any exception handling and
the entire process fails to release the lock.

> 
> It appears you're putting the locking around the search for pre-
> configured tap devices which is already atomic and working well from a
> race perspective?

That is the actual race condition here.
On the on hand you have the search function, which in itself is atomic 
and save yes but then you have the setup function below

1198            try:
1199                tap = 
subprocess.check_output(cmd).decode('utf-8').strip()
1200            except subprocess.CalledProcessError as e:
1201                logger.error('Setting up tap device failed:\n%s\nRun 
runqemu-gen-tapdevs to manually create one.' % str(e))
1202                sys.exit(1)
1203            lockfile = os.path.join(lockdir, tap)
1204            self.taplock = lockfile + '.lock'
1205            self.acquire_taplock()

the section from line 1199 till 1205 is completely unguarded from a lock 
perspective.
Meaning (and I think that is what is actually happening) that the new 
tap device is created as the result of line 1199 by process A but locked 
by process B, as the search function finds it and acquires the lock.
So in my opinion the entirely function needs to allow only a single 
instance to work on it (at least that is the most straight forward 
solution I can imagine).
The only thing I couldn't figure out in the hundreds of test runs is, 
why the 1205 line lock isn't failing with an error, nevertheless this 
should be fixed and a function wide exclusive lock fixed the issue for me.

> 
>> @@ -1198,6 +1203,8 @@ to your build configuration.
>>               try:
>>                   tap = subprocess.check_output(cmd).decode('utf-8').strip()
>>               except subprocess.CalledProcessError as e:
>> +                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
>> +                tap_setup_lock.close()
>>                   logger.error('Setting up tap device failed:\n%s\nRun runqemu-gen-tapdevs to manually create one.' % str(e))
>>                   sys.exit(1)
>>               lockfile = os.path.join(lockdir, tap)
>> @@ -1206,6 +1213,9 @@ to your build configuration.
>>               self.cleantap = True
>>               logger.debug('Created tap: %s' % tap)
>>   
>> +        fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
>> +        tap_setup_lock.close()
> 
> For locks like this, they should really be in a try/finally or with clause...

So you want to wrap up everything into a with block, is that what you meant?

> 
> Cheers,
> 
> Richard
Richard Purdie Sept. 6, 2024, 12:01 p.m. UTC | #3
On Fri, 2024-09-06 at 10:18 +0200, Konrad Weihmann wrote:
> On 05.09.24 11:56, Richard Purdie wrote:
> > On Wed, 2024-09-04 at 12:21 +0000, Konrad Weihmann via lists.openembedded.org wrote:
> > > in case of running two or more runqemu instances in parallel
> > > with no previously setup tap devices, the following happens:
> > > 
> > > instance A probes for tap devices, but doesn't find
> > > any, proceeds to generating tap devices with the sudo call,
> > > resulting in tap0.
> > > instance B starts to probes, finds tap0.
> > > Both will lock tap0.
> > > 
> > > tap0 will be then forwarded to qemu.
> > > Instance A reporting
> > > 
> > > "Using preconfigured tap device tap0"
> > > 
> > > but then failing with
> > > 
> > > qemu-system... could not configure /dev/net/tun (tap0): Device or resource busy
> > > 
> > > To fix that, lock the entire tap creation process with
> > > an exclusive file (blocking) lock, so only a single instance can
> > > perform the non-atomic changes.
> > > 
> > > Signed-off-by: Konrad Weihmann <kweihmann@outlook.com>
> > > ---
> > >   scripts/runqemu | 10 ++++++++++
> > >   1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/scripts/runqemu b/scripts/runqemu
> > > index 2817acb19f..e5db06be37 100755
> > > --- a/scripts/runqemu
> > > +++ b/scripts/runqemu
> > > @@ -1163,6 +1163,9 @@ to your build configuration.
> > >   
> > >           self.make_lock_dir(lockdir)
> > >   
> > > +        tap_setup_lock = open(os.path.join(lockdir, '_tap_creation_lock'), 'w')
> > > +        fcntl.flock(tap_setup_lock, fcntl.LOCK_EX)
> > > +
> > >           cmd = (ip, 'link')
> > >           logger.debug('Running %s...' % str(cmd))
> > >           ip_link = subprocess.check_output(cmd).decode('utf-8')
> > > @@ -1187,6 +1190,8 @@ to your build configuration.
> > >   
> > >           if not tap:
> > >               if os.path.exists(nosudo_flag):
> > > +                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
> > > +                tap_setup_lock.close()
> > >                   logger.error("Error: There are no available tap devices to use for networking,")
> > >                   logger.error("and I see %s exists, so I am not going to try creating" % nosudo_flag)
> > >                   raise RunQemuError("a new one with sudo.")
> > 
> > This doesn't make sense to me. Surely you should be taking this lock
> > *within* the "if not tap" code block since that is where the ifup is
> > called out to with sudo? You appear to be dropping the lock there
> > instead?
> 
> Right before the raise of an exception the lock should be released, 
> shouldn't it?

Ok, I see why you've done it this way. I'd much prefer to move the lock
to just be in that conditional though. I don't believe it is needed in
the earlier search code - that code works fine today and is only used
if the devices are pre-setup.

> Otherwise we would rely on the fact that none of the calling functions 
> does any exception handling and
> the entire process fails to release the lock.
> 
> > 
> > It appears you're putting the locking around the search for pre-
> > configured tap devices which is already atomic and working well from a
> > race perspective?
> 
> That is the actual race condition here.
> On the on hand you have the search function, which in itself is atomic 
> and save yes but then you have the setup function below
> 
> 1198            try:
> 1199                tap = 
> subprocess.check_output(cmd).decode('utf-8').strip()
> 1200            except subprocess.CalledProcessError as e:
> 1201                logger.error('Setting up tap device failed:\n%s\nRun 
> runqemu-gen-tapdevs to manually create one.' % str(e))
> 1202                sys.exit(1)
> 1203            lockfile = os.path.join(lockdir, tap)
> 1204            self.taplock = lockfile + '.lock'
> 1205            self.acquire_taplock()
> 
> the section from line 1199 till 1205 is completely unguarded from a lock 
> perspective.
> Meaning (and I think that is what is actually happening) that the new
> tap device is created as the result of line 1199 by process A but locked 
> by process B, as the search function finds it and acquires the lock.
> So in my opinion the entirely function needs to allow only a single 
> instance to work on it (at least that is the most straight forward 
> solution I can imagine).
> The only thing I couldn't figure out in the hundreds of test runs is,
> why the 1205 line lock isn't failing with an error, nevertheless this
> should be fixed and a function wide exclusive lock fixed the issue for me.

Looking at acquire_taplock(), it will print an error instead of an info
message and merrily continue on as the return value isn't checked. That
is perhaps the first key issue to fix as even with your change, that is
still a problem.

I think rather than add potentially fragile locking, we should try and
acquire the lock, if we can't, we should try and add yet another tap
device and continue until something hard errors...

Cheers,

Richard
diff mbox series

Patch

diff --git a/scripts/runqemu b/scripts/runqemu
index 2817acb19f..e5db06be37 100755
--- a/scripts/runqemu
+++ b/scripts/runqemu
@@ -1163,6 +1163,9 @@  to your build configuration.
 
         self.make_lock_dir(lockdir)
 
+        tap_setup_lock = open(os.path.join(lockdir, '_tap_creation_lock'), 'w')
+        fcntl.flock(tap_setup_lock, fcntl.LOCK_EX)
+
         cmd = (ip, 'link')
         logger.debug('Running %s...' % str(cmd))
         ip_link = subprocess.check_output(cmd).decode('utf-8')
@@ -1187,6 +1190,8 @@  to your build configuration.
 
         if not tap:
             if os.path.exists(nosudo_flag):
+                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
+                tap_setup_lock.close()
                 logger.error("Error: There are no available tap devices to use for networking,")
                 logger.error("and I see %s exists, so I am not going to try creating" % nosudo_flag)
                 raise RunQemuError("a new one with sudo.")
@@ -1198,6 +1203,8 @@  to your build configuration.
             try:
                 tap = subprocess.check_output(cmd).decode('utf-8').strip()
             except subprocess.CalledProcessError as e:
+                fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
+                tap_setup_lock.close()
                 logger.error('Setting up tap device failed:\n%s\nRun runqemu-gen-tapdevs to manually create one.' % str(e))
                 sys.exit(1)
             lockfile = os.path.join(lockdir, tap)
@@ -1206,6 +1213,9 @@  to your build configuration.
             self.cleantap = True
             logger.debug('Created tap: %s' % tap)
 
+        fcntl.flock(tap_setup_lock, fcntl.LOCK_UN)
+        tap_setup_lock.close()
+
         if not tap:
             logger.error("Failed to setup tap device. Run runqemu-gen-tapdevs to manually create.")
             sys.exit(1)