diff mbox series

[bitbake-devel,v2] siggen: Fix sigtask data not being renamed atomically

Message ID 20220802131405.318201-1-JPEWhacker@gmail.com
State New
Headers show
Series [bitbake-devel,v2] siggen: Fix sigtask data not being renamed atomically | expand

Commit Message

Joshua Watt Aug. 2, 2022, 1:14 p.m. UTC
Signature generation uses mkstemp() to get a file descriptor to a unique
file and then write the signature into it, however it closed the file
before doing the chmod() and rename() operations. Closing the file means
that other mkstemp() could potentially open the same file and race with
the chmod() and rename(), causing a error. While it may not sound like
this would be very likely, glibc (at least) generates the filename for
mkstemp() using the system clock, meaning that it is much more likely
for highly parallel builds sharing sstate over NFS to encounter the race
condition.

To fix the problem, perform the chmod() and rename() while the file is
still open, since this prevents other mkstemp() calls from being able to
open the file (due to the O_EXCL flag).

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 bitbake/lib/bb/siggen.py | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Comments

Rasmus Villemoes Aug. 3, 2022, 11:40 a.m. UTC | #1
On 02/08/2022 15.14, Joshua Watt via lists.openembedded.org wrote:
> Signature generation uses mkstemp() to get a file descriptor to a unique
> file and then write the signature into it, however it closed the file
> before doing the chmod() and rename() operations. Closing the file means
> that other mkstemp() could potentially open the same file and race with
> the chmod() and rename(), causing a error. While it may not sound like
> this would be very likely, glibc (at least) generates the filename for
> mkstemp() using the system clock, meaning that it is much more likely
> for highly parallel builds sharing sstate over NFS to encounter the race
> condition.
> 
> To fix the problem, perform the chmod() and rename() while the file is
> still open, since this prevents other mkstemp() calls from being able to
> open the file (due to the O_EXCL flag).

Eh, O_CREAT|O_EXCL should prevent opening an already existing file,
whether somebody has it open or not, so I don't see how another
mkstemp() could end up opening "our" file. Unless O_EXCL semantics are
broken on NFS and only work by accident when the local client does have
the file open - but what would then prevent other clients with,
presumably, the exact same system time from clobbering our file, whether
or not the chmod+rename is done with the fd still held?

The patch is probably fine (as in, shouldn't make anything worse), but I
think it would be nice to understand just what the problem actually is,
and if this fixes it, how.

Rasmus
Joshua Watt Aug. 3, 2022, 1:21 p.m. UTC | #2
On 8/3/22 06:40, Rasmus Villemoes wrote:
> On 02/08/2022 15.14, Joshua Watt via lists.openembedded.org wrote:
>> Signature generation uses mkstemp() to get a file descriptor to a unique
>> file and then write the signature into it, however it closed the file
>> before doing the chmod() and rename() operations. Closing the file means
>> that other mkstemp() could potentially open the same file and race with
>> the chmod() and rename(), causing a error. While it may not sound like
>> this would be very likely, glibc (at least) generates the filename for
>> mkstemp() using the system clock, meaning that it is much more likely
>> for highly parallel builds sharing sstate over NFS to encounter the race
>> condition.
>>
>> To fix the problem, perform the chmod() and rename() while the file is
>> still open, since this prevents other mkstemp() calls from being able to
>> open the file (due to the O_EXCL flag).
> Eh, O_CREAT|O_EXCL should prevent opening an already existing file,
> whether somebody has it open or not, so I don't see how another
> mkstemp() could end up opening "our" file. Unless O_EXCL semantics are
> broken on NFS and only work by accident when the local client does have
> the file open - but what would then prevent other clients with,
> presumably, the exact same system time from clobbering our file, whether
> or not the chmod+rename is done with the fd still held?

Yes, for some reason I had it my head that O_EXCL only affects open 
files; so I'm not sure. Maybe NFS is the culprit :/

>
> The patch is probably fine (as in, shouldn't make anything worse), but I
> think it would be nice to understand just what the problem actually is,
> and if this fixes it, how.

Ya, we are definetely getting errors like:


FileNotFoundError: [Errno 2] No such file or directory: 
'/mnt/releases/sstate/99/sigtask.wq6gkrm7' -> 
'/mnt/releases/sstate/99/sstate:hdparm::9.48:r0::3:990d8b39c15752a2dd68369cb3609283_unpack.tgz.siginfo'

With a backtrace to the os.rename() under heavy load. I don't understand 
how the chmod() can succeed and the rename fail unless there is a race. 
It's very hard to diagnose what went wrong post-mortem.

My other option was to manually add more entropy to the file names by 
prepending some random characters after "sigtask", since the 
"randomeness" of mkstemp() isn't great all tasks use the same sigtask 
file prefix (which increases the likely hood of a collision).
>
> Rasmus
Joshua Watt Aug. 3, 2022, 2:04 p.m. UTC | #3
On Wed, Aug 3, 2022 at 8:21 AM Joshua Watt <jpewhacker@gmail.com> wrote:
>
>
> On 8/3/22 06:40, Rasmus Villemoes wrote:
>
> On 02/08/2022 15.14, Joshua Watt via lists.openembedded.org wrote:
>
> Signature generation uses mkstemp() to get a file descriptor to a unique
> file and then write the signature into it, however it closed the file
> before doing the chmod() and rename() operations. Closing the file means
> that other mkstemp() could potentially open the same file and race with
> the chmod() and rename(), causing a error. While it may not sound like
> this would be very likely, glibc (at least) generates the filename for
> mkstemp() using the system clock, meaning that it is much more likely
> for highly parallel builds sharing sstate over NFS to encounter the race
> condition.
>
> To fix the problem, perform the chmod() and rename() while the file is
> still open, since this prevents other mkstemp() calls from being able to
> open the file (due to the O_EXCL flag).
>
> Eh, O_CREAT|O_EXCL should prevent opening an already existing file,
> whether somebody has it open or not, so I don't see how another
> mkstemp() could end up opening "our" file. Unless O_EXCL semantics are
> broken on NFS and only work by accident when the local client does have
> the file open - but what would then prevent other clients with,
> presumably, the exact same system time from clobbering our file, whether
> or not the chmod+rename is done with the fd still held?
>
> Yes, for some reason I had it my head that O_EXCL only affects open files; so I'm not sure. Maybe NFS is the culprit :/
>
> The patch is probably fine (as in, shouldn't make anything worse), but I
> think it would be nice to understand just what the problem actually is,
> and if this fixes it, how.
>
> Ya, we are definetely getting errors like:
>
>
> FileNotFoundError: [Errno 2] No such file or directory: '/mnt/releases/sstate/99/sigtask.wq6gkrm7' -> '/mnt/releases/sstate/99/sstate:hdparm::9.48:r0::3:990d8b39c15752a2dd68369cb3609283_unpack.tgz.siginfo'
>
> With a backtrace to the os.rename() under heavy load. I don't understand how the chmod() can succeed and the rename fail unless there is a race. It's very hard to diagnose what went wrong post-mortem.
>
> My other option was to manually add more entropy to the file names by prepending some random characters after "sigtask", since the "randomeness" of mkstemp() isn't great all tasks use the same sigtask file prefix (which increases the likely hood of a collision).

Ok, after further investigation, I realized that our NFS server (which
I don't have any control over) is not Linux, so I have the sneaking
suspicion that it doesn't implement O_EXCL correctly, or perhaps not
correctly across multiple clients. In light of that, this patch isn't
going to do what I want anyway, and I've sent a new patch that
increases the entropy to mkstemp() so it's not time based which should
solve the problem in all cases

>
> Rasmus
diff mbox series

Patch

diff --git a/bitbake/lib/bb/siggen.py b/bitbake/lib/bb/siggen.py
index 3f3d6df54d..f3c8ff796f 100644
--- a/bitbake/lib/bb/siggen.py
+++ b/bitbake/lib/bb/siggen.py
@@ -426,18 +426,18 @@  class SignatureGeneratorBasic(SignatureGenerator):
                 sigfile = sigfile.replace(self.taskhash[tid], computed_taskhash)
 
         fd, tmpfile = tempfile.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.")
-        try:
-            with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f:
+        with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f:
+            try:
                 json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder)
                 f.flush()
-            os.chmod(tmpfile, 0o664)
-            bb.utils.rename(tmpfile, sigfile)
-        except (OSError, IOError) as err:
-            try:
-                os.unlink(tmpfile)
-            except OSError:
-                pass
-            raise err
+                os.fchmod(fd, 0o664)
+                bb.utils.rename(tmpfile, sigfile)
+            except (OSError, IOError) as err:
+                try:
+                    os.unlink(tmpfile)
+                except OSError:
+                    pass
+                raise err
 
     def dump_sigfn(self, fn, dataCaches, options):
         if fn in self.taskdeps: