| Message ID | 20241005004451.79575-1-yumx@amazon.com |
|---|---|
| State | New |
| Headers | show |
| Series | sstate: remove corrupted artifacts from local mirror | expand |
I don't think I can agree with this. This means stability regressions in one's network infrastructure will go unnoticed and unfixed, and goes contrary to some people wishing to know immediately when 'corrupted sstate' occurs, and fix it there and then. Similar patches have been rejected in the past. Please fix your infra. Alex On Sat, 5 Oct 2024 at 02:44, Yu, Max via lists.openembedded.org <yumx=amazon.com@lists.openembedded.org> wrote: > > We observe sstate cache corruptions sometimes which cause rebuilds. That is not > a fatal error as the package has to be rebuilt and updated artifact needs to be > pushed to remote sstate cache mirror. Currently, Yocto does not handle > corruptions properly, where the corrupted artifact is not deleted or > renamed. Later, after the package is built the same corrupted artifact is pushed > to remote mirror and the same procedure is circled again and again. > > This change verifies the outcome of the unpacking action and renames the > artifact if a fatal error occurred ("tar" tool returns error 2). In such case we > rename the artifact what causes that a proper one is created and uploaded > overwriting the exisiting one - the corrupted one - in the remote mirror. That > way we break the loop of uploading corrupted file again and again. > > Suggested-by: Przemyslaw Sobon <psobon@amazon.com> > Signed-off-by: Max Yu <yumx@amazon.com> > --- > meta/classes-global/sstate.bbclass | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass > index 11bb892a42..5a7ce35341 100644 > --- a/meta/classes-global/sstate.bbclass > +++ b/meta/classes-global/sstate.bbclass > @@ -937,7 +937,12 @@ sstate_unpack_package () { > ZSTD="pzstd -p ${ZSTD_THREADS}" > fi > > - tar -I "$ZSTD" -xvpf ${SSTATE_PKG} > + if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then > + echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming" > + mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error > + exit 2 > + fi > + > # update .siginfo atime on local/NFS mirror if it is a symbolic link > [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true > # update each symbolic link instead of any referenced file > -- > 2.40.1 > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#205237): https://lists.openembedded.org/g/openembedded-core/message/205237 > Mute This Topic: https://lists.openembedded.org/mt/108828269/1686489 > Group Owner: openembedded-core+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [alex.kanavin@gmail.com] > -=-=-=-=-=-=-=-=-=-=-=- >
Hi Alex, Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected. We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6. For context, we have builds running at a large scale, almost 24/7. This scale contributes to the following challenges: 1. sstate corruption for us happens ~1/10000 of builds. These are extremely hard to reproduce and debug... 2. with current behavior of reuploading the corrupted sstate object, we end up in an endless loop of rebuilding that we cannot break. We considered this yocto behavior as a bug, specifically because of #2. I hear your point, even though following your reasoning we should fail the entire build rather than rebuilding the packages. We can explore other options like parameterizing the behavior, but it will be very useful to be able to break the bad loop of rebuilding somehow. Thanks, Max On 10/7/24, 4:58 AM, "Alexander Kanavin" <alex.kanavin@gmail.com <mailto:alex.kanavin@gmail.com>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. I don't think I can agree with this. This means stability regressions in one's network infrastructure will go unnoticed and unfixed, and goes contrary to some people wishing to know immediately when 'corrupted sstate' occurs, and fix it there and then. Similar patches have been rejected in the past. Please fix your infra. Alex On Sat, 5 Oct 2024 at 02:44, Yu, Max via lists.openembedded.org <yumx=amazon.com@lists.openembedded.org <mailto:amazon.com@lists.openembedded.org>> wrote: > > We observe sstate cache corruptions sometimes which cause rebuilds. That is not > a fatal error as the package has to be rebuilt and updated artifact needs to be > pushed to remote sstate cache mirror. Currently, Yocto does not handle > corruptions properly, where the corrupted artifact is not deleted or > renamed. Later, after the package is built the same corrupted artifact is pushed > to remote mirror and the same procedure is circled again and again. > > This change verifies the outcome of the unpacking action and renames the > artifact if a fatal error occurred ("tar" tool returns error 2). In such case we > rename the artifact what causes that a proper one is created and uploaded > overwriting the exisiting one - the corrupted one - in the remote mirror. That > way we break the loop of uploading corrupted file again and again. > > Suggested-by: Przemyslaw Sobon <psobon@amazon.com <mailto:psobon@amazon.com>> > Signed-off-by: Max Yu <yumx@amazon.com <mailto:yumx@amazon.com>> > --- > meta/classes-global/sstate.bbclass | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass > index 11bb892a42..5a7ce35341 100644 > --- a/meta/classes-global/sstate.bbclass > +++ b/meta/classes-global/sstate.bbclass > @@ -937,7 +937,12 @@ sstate_unpack_package () { > ZSTD="pzstd -p ${ZSTD_THREADS}" > fi > > - tar -I "$ZSTD" -xvpf ${SSTATE_PKG} > + if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then > + echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming" > + mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error > + exit 2 > + fi > + > # update .siginfo atime on local/NFS mirror if it is a symbolic link > [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true > # update each symbolic link instead of any referenced file > -- > 2.40.1 > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#205237): https://lists.openembedded.org/g/openembedded-core/message/205237 <https://lists.openembedded.org/g/openembedded-core/message/205237> > Mute This Topic: https://lists.openembedded.org/mt/108828269/1686489 <https://lists.openembedded.org/mt/108828269/1686489> > Group Owner: openembedded-core+owner@lists.openembedded.org <mailto:openembedded-core+owner@lists.openembedded.org> > Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub <https://lists.openembedded.org/g/openembedded-core/unsub> [alex.kanavin@gmail.com <mailto:alex.kanavin@gmail.com>] > -=-=-=-=-=-=-=-=-=-=-=- >
On Mon, 7 Oct 2024 at 22:09, Yu, Max <yumx@amazon.com> wrote: > Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected. > > We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6. > > For context, we have builds running at a large scale, almost 24/7. This scale contributes to the following challenges: > 1. sstate corruption for us happens ~1/10000 of builds. These are extremely hard to reproduce and debug... > 2. with current behavior of reuploading the corrupted sstate object, we end up in an endless loop of rebuilding that we cannot break. > We considered this yocto behavior as a bug, specifically because of #2. > > I hear your point, even though following your reasoning we should fail the entire build rather than rebuilding the packages. We can explore other options like parameterizing the behavior, but it will be very useful to be able to break the bad loop of rebuilding somehow. Perhaps you can describe how the issue can be reproduced and observed in a local build? (e.g. build something, corrupt sstate for it, observe the endless rebuild problem) I'd like to understand particularly where in existing code the endless cycle would be triggered. Alex
On Mon, 7 Oct 2024 at 22:09, Yu, Max <yumx@amazon.com> wrote: > Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected. I'm sorry but I am not convinced. You are the only one reporting the issue, so I don't think it's unavoidable. > We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6. What the patch is doing is essentially -c cleansstate on a cache used by several consumers. This makes it a non-starter: you can't remove items from live sstate like that. Once an object is in the cache, it needs to be removed offline, or replaced atomically if it's corrupted. Before we decide how to handle corrupted items, it's perhaps better to consider first how they should be reported. Then users can decide what they want to do with that information without yocto forcing something on them. So please: how can I observe the issue? Alex
diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass index 11bb892a42..5a7ce35341 100644 --- a/meta/classes-global/sstate.bbclass +++ b/meta/classes-global/sstate.bbclass @@ -937,7 +937,12 @@ sstate_unpack_package () { ZSTD="pzstd -p ${ZSTD_THREADS}" fi - tar -I "$ZSTD" -xvpf ${SSTATE_PKG} + if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then + echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming" + mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error + exit 2 + fi + # update .siginfo atime on local/NFS mirror if it is a symbolic link [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true # update each symbolic link instead of any referenced file
We observe sstate cache corruptions sometimes which cause rebuilds. That is not a fatal error as the package has to be rebuilt and updated artifact needs to be pushed to remote sstate cache mirror. Currently, Yocto does not handle corruptions properly, where the corrupted artifact is not deleted or renamed. Later, after the package is built the same corrupted artifact is pushed to remote mirror and the same procedure is circled again and again. This change verifies the outcome of the unpacking action and renames the artifact if a fatal error occurred ("tar" tool returns error 2). In such case we rename the artifact what causes that a proper one is created and uploaded overwriting the exisiting one - the corrupted one - in the remote mirror. That way we break the loop of uploading corrupted file again and again. Suggested-by: Przemyslaw Sobon <psobon@amazon.com> Signed-off-by: Max Yu <yumx@amazon.com> --- meta/classes-global/sstate.bbclass | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)