diff mbox series

sstate: remove corrupted artifacts from local mirror

Message ID 20241005004451.79575-1-yumx@amazon.com
State New
Headers show
Series sstate: remove corrupted artifacts from local mirror | expand

Commit Message

Yu, Max Oct. 5, 2024, 12:44 a.m. UTC
We observe sstate cache corruptions sometimes which cause rebuilds. That is not
a fatal error as the package has to be rebuilt and updated artifact needs to be
pushed to remote sstate cache mirror. Currently, Yocto does not handle
corruptions properly, where the corrupted artifact is not deleted or
renamed. Later, after the package is built the same corrupted artifact is pushed
to remote mirror and the same procedure is circled again and again.

This change verifies the outcome of the unpacking action and renames the
artifact if a fatal error occurred ("tar" tool returns error 2). In such case we
rename the artifact what causes that a proper one is created and uploaded
overwriting the exisiting one - the corrupted one - in the remote mirror. That
way we break the loop of uploading corrupted file again and again.

Suggested-by: Przemyslaw Sobon <psobon@amazon.com>
Signed-off-by: Max Yu <yumx@amazon.com>
---
 meta/classes-global/sstate.bbclass | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Alexander Kanavin Oct. 7, 2024, 11:57 a.m. UTC | #1
I don't think I can agree with this. This means stability regressions
in one's network infrastructure will go unnoticed and unfixed, and
goes contrary to some people wishing to know immediately when
'corrupted sstate' occurs, and fix it there and then. Similar patches
have been rejected in the past.

Please fix your infra.

Alex

On Sat, 5 Oct 2024 at 02:44, Yu, Max via lists.openembedded.org
<yumx=amazon.com@lists.openembedded.org> wrote:
>
> We observe sstate cache corruptions sometimes which cause rebuilds. That is not
> a fatal error as the package has to be rebuilt and updated artifact needs to be
> pushed to remote sstate cache mirror. Currently, Yocto does not handle
> corruptions properly, where the corrupted artifact is not deleted or
> renamed. Later, after the package is built the same corrupted artifact is pushed
> to remote mirror and the same procedure is circled again and again.
>
> This change verifies the outcome of the unpacking action and renames the
> artifact if a fatal error occurred ("tar" tool returns error 2). In such case we
> rename the artifact what causes that a proper one is created and uploaded
> overwriting the exisiting one - the corrupted one - in the remote mirror. That
> way we break the loop of uploading corrupted file again and again.
>
> Suggested-by: Przemyslaw Sobon <psobon@amazon.com>
> Signed-off-by: Max Yu <yumx@amazon.com>
> ---
>  meta/classes-global/sstate.bbclass | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass
> index 11bb892a42..5a7ce35341 100644
> --- a/meta/classes-global/sstate.bbclass
> +++ b/meta/classes-global/sstate.bbclass
> @@ -937,7 +937,12 @@ sstate_unpack_package () {
>                 ZSTD="pzstd -p ${ZSTD_THREADS}"
>         fi
>
> -       tar -I "$ZSTD" -xvpf ${SSTATE_PKG}
> +       if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then
> +               echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming"
> +               mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error
> +               exit 2
> +       fi
> +
>         # update .siginfo atime on local/NFS mirror if it is a symbolic link
>         [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true
>         # update each symbolic link instead of any referenced file
> --
> 2.40.1
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#205237): https://lists.openembedded.org/g/openembedded-core/message/205237
> Mute This Topic: https://lists.openembedded.org/mt/108828269/1686489
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [alex.kanavin@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Yu, Max Oct. 7, 2024, 8:09 p.m. UTC | #2
Hi Alex,

Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected.

We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6.

For context, we have builds running at a large scale, almost 24/7. This scale contributes to the following challenges:
1. sstate corruption for us happens ~1/10000 of builds. These are extremely hard to reproduce and debug...
2. with current behavior of reuploading the corrupted sstate object, we end up in an endless loop of rebuilding that we cannot break.
We considered this yocto behavior as a bug, specifically because of #2.

I hear your point, even though following your reasoning we should fail the entire build rather than rebuilding the packages. We can explore other options like parameterizing the behavior, but it will be very useful to be able to break the bad loop of rebuilding somehow.

Thanks,
Max

On 10/7/24, 4:58 AM, "Alexander Kanavin" <alex.kanavin@gmail.com <mailto:alex.kanavin@gmail.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.






I don't think I can agree with this. This means stability regressions
in one's network infrastructure will go unnoticed and unfixed, and
goes contrary to some people wishing to know immediately when
'corrupted sstate' occurs, and fix it there and then. Similar patches
have been rejected in the past.


Please fix your infra.


Alex


On Sat, 5 Oct 2024 at 02:44, Yu, Max via lists.openembedded.org
<yumx=amazon.com@lists.openembedded.org <mailto:amazon.com@lists.openembedded.org>> wrote:
>
> We observe sstate cache corruptions sometimes which cause rebuilds. That is not
> a fatal error as the package has to be rebuilt and updated artifact needs to be
> pushed to remote sstate cache mirror. Currently, Yocto does not handle
> corruptions properly, where the corrupted artifact is not deleted or
> renamed. Later, after the package is built the same corrupted artifact is pushed
> to remote mirror and the same procedure is circled again and again.
>
> This change verifies the outcome of the unpacking action and renames the
> artifact if a fatal error occurred ("tar" tool returns error 2). In such case we
> rename the artifact what causes that a proper one is created and uploaded
> overwriting the exisiting one - the corrupted one - in the remote mirror. That
> way we break the loop of uploading corrupted file again and again.
>
> Suggested-by: Przemyslaw Sobon <psobon@amazon.com <mailto:psobon@amazon.com>>
> Signed-off-by: Max Yu <yumx@amazon.com <mailto:yumx@amazon.com>>
> ---
> meta/classes-global/sstate.bbclass | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass
> index 11bb892a42..5a7ce35341 100644
> --- a/meta/classes-global/sstate.bbclass
> +++ b/meta/classes-global/sstate.bbclass
> @@ -937,7 +937,12 @@ sstate_unpack_package () {
> ZSTD="pzstd -p ${ZSTD_THREADS}"
> fi
>
> - tar -I "$ZSTD" -xvpf ${SSTATE_PKG}
> + if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then
> + echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming"
> + mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error
> + exit 2
> + fi
> +
> # update .siginfo atime on local/NFS mirror if it is a symbolic link
> [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true
> # update each symbolic link instead of any referenced file
> --
> 2.40.1
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#205237): https://lists.openembedded.org/g/openembedded-core/message/205237 <https://lists.openembedded.org/g/openembedded-core/message/205237>
> Mute This Topic: https://lists.openembedded.org/mt/108828269/1686489 <https://lists.openembedded.org/mt/108828269/1686489>
> Group Owner: openembedded-core+owner@lists.openembedded.org <mailto:openembedded-core+owner@lists.openembedded.org>
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub <https://lists.openembedded.org/g/openembedded-core/unsub> [alex.kanavin@gmail.com <mailto:alex.kanavin@gmail.com>]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Alexander Kanavin Oct. 7, 2024, 8:38 p.m. UTC | #3
On Mon, 7 Oct 2024 at 22:09, Yu, Max <yumx@amazon.com> wrote:
> Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected.
>
> We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6.
>
> For context, we have builds running at a large scale, almost 24/7. This scale contributes to the following challenges:
> 1. sstate corruption for us happens ~1/10000 of builds. These are extremely hard to reproduce and debug...
> 2. with current behavior of reuploading the corrupted sstate object, we end up in an endless loop of rebuilding that we cannot break.
> We considered this yocto behavior as a bug, specifically because of #2.
>
> I hear your point, even though following your reasoning we should fail the entire build rather than rebuilding the packages. We can explore other options like parameterizing the behavior, but it will be very useful to be able to break the bad loop of rebuilding somehow.

Perhaps you can describe how the issue can be reproduced and observed
in a local build? (e.g. build something, corrupt sstate for it,
observe the endless rebuild problem)
I'd like to understand particularly where in existing code the endless
cycle would be triggered.

Alex
Alexander Kanavin Oct. 8, 2024, 9:37 a.m. UTC | #4
On Mon, 7 Oct 2024 at 22:09, Yu, Max <yumx@amazon.com> wrote:
> Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected.

I'm sorry but I am not convinced. You are the only one reporting the
issue, so I don't think it's unavoidable.

> We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6.

What the patch is doing is essentially -c cleansstate on a cache used
by several consumers. This makes it a non-starter: you can't remove
items from live sstate like that. Once an object is in the cache, it
needs to be removed offline, or replaced atomically if it's corrupted.

Before we decide how to handle corrupted items, it's perhaps better to
consider first how they should be reported. Then users can decide what
they want to do with that information without yocto forcing something
on them. So please: how can I observe the issue?

Alex
diff mbox series

Patch

diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass
index 11bb892a42..5a7ce35341 100644
--- a/meta/classes-global/sstate.bbclass
+++ b/meta/classes-global/sstate.bbclass
@@ -937,7 +937,12 @@  sstate_unpack_package () {
 		ZSTD="pzstd -p ${ZSTD_THREADS}"
 	fi
 
-	tar -I "$ZSTD" -xvpf ${SSTATE_PKG}
+	if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then
+		echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming"
+		mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error
+		exit 2
+	fi
+
 	# update .siginfo atime on local/NFS mirror if it is a symbolic link
 	[ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true
 	# update each symbolic link instead of any referenced file