diff mbox series

[AUH] upgrade-helper: add state module and --incremental mode

Message ID 20260601113410.1896488-1-daniel.turull@ericsson.com
State New
Headers show
Series [AUH] upgrade-helper: add state module and --incremental mode | expand

Commit Message

Daniel Turull June 1, 2026, 11:34 a.m. UTC
From: Daniel Turull <daniel.turull@ericsson.com>

Without this, every AUH run retries all candidates regardless of previous
outcomes, wasting time on recipes that consistently fail or were already
upgraded recently.

Add modules/state.py which persists upgrade results to a JSON state file
(auh-state.json in the upgrade-helper work directory, typically
$BUILDDIR/upgrade-helper/). The new --incremental flag activates this
tracking and skips package groups where all packages were recently attempted.

Behavior:
- Failed attempts are retried after retry_interval days (default 30).
- Successful/already-current upgrades are suppressed for success_max_age
  days (default 30).

Tested with master (2026-06-01):
command: ../auto-upgrade-helper/upgrade-helper.py all --incremental -s

- Run 1 (2026-06-01 07:38): 122 attempted, 98 succeeded, 24 failed, 0 skipped
- Run 2 (2026-06-01 11:17): 1 attempted, 0 succeeded, 1 failed, 26 skipped
- Run 3 (2026-06-01 11:20): 0 attempted, 0 succeeded, 0 failed, 27 skipped

Link: https://lists.openembedded.org/g/openembedded-architecture/message/2349

AI-Generated: kiro with claude-opus-4.6 model
Signed-off-by: Daniel Turull <daniel.turull@ericsson.com>
---
 modules/state.py    | 102 ++++++++++++++++++++++++++++++++++++++++++++
 upgrade-helper.conf |  10 ++++-
 upgrade-helper.py   |  41 ++++++++++++++++++
 3 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 modules/state.py

Comments

Alexander Kanavin June 2, 2026, 9:54 a.m. UTC | #1
On Mon, 1 Jun 2026 at 14:04, <daniel.turull@ericsson.com> wrote:
> Without this, every AUH run retries all candidates regardless of previous
> outcomes, wasting time on recipes that consistently fail or were already
> upgraded recently.

Thanks for working on these, much appreciated! Comments below.

> +    def should_skip(self, pn, version):
> +        entry = self.data.get(pn, {}).get(version)
> +        if not entry:
> +            return False
> +        age = time.time() - entry.get("timestamp", 0)
> +        if entry.get("result") == RESULT_SUCCESS:
> +            return age < self.success_max_age
> +        return age < self.cooldown
> +
> +    def _prune(self):
> +        """Remove stale entries older than their respective max-age."""
> +        now = time.time()
> +        for pn in list(self.data):
> +            versions = self.data[pn]
> +            for ver in list(versions):
> +                entry = versions[ver]
> +                age = now - entry.get("timestamp", 0)
> +                if entry.get("result") == RESULT_SUCCESS:
> +                    if age > self.success_max_age:
> +                        del versions[ver]
> +                else:
> +                    if age > self.cooldown:
> +                        del versions[ver]
> +            if not versions:
> +                del self.data[pn]

I am somewhat confused as to how these functions complement each
other, because they seem to repeat each other's timestamp logic. If
entries that are too old are already pruned at the start, then the
remaining entries will always have their age less than maximum
periods, and it's not necessary to compare times in should_skip(), and
it should simply return True. No?

I also think that any 'decisions' regarding pruning, and skipping
should be at least printed out, so that the log contains information
about why something wasn't attempted, or why it was attempted again,
including information about when it was attempted previously and what
was the result then.

> @@ -520,6 +542,10 @@ class Updater(object):
>                  g['error'] = e
>                  failed_pkggroups_ctx.append(g)
>
> +            # Capture upgrade error before commit_changes() which may
> +            # overwrite g['error'] with a git commit failure.
> +            upgrade_err = g.get('error') if self.state else None
> +

Such special-casing always makes me a bit uneasy. What is the scenario
where this matters? Did you hit that in testing? If the upgrade itself
succeeded, but the commit failed, then the overall result is 'failed',
and we shouldn't roll it back.

> +            if self.state:
> +                # UpgradeNotNeededError means the recipe is already current,
> +                # which is a successful outcome — no retry needed.
> +                if not upgrade_err or isinstance(upgrade_err, UpgradeNotNeededError):
> +                    result = RESULT_SUCCESS
> +                else:
> +                    result = RESULT_FAILURE

Looking at code, there isn't any place where UpgradeNotNeededError is
actually raised, so you could probably add a patch that removes it
altogether for lack of use?

Alex
Daniel Turull June 2, 2026, 1:18 p.m. UTC | #2
Thanks for the review.

> -----Original Message-----
> From: Alexander Kanavin <alex.kanavin@gmail.com>
> Sent: Tuesday, 2 June 2026 11:55
> To: Daniel Turull <daniel.turull@ericsson.com>
> Cc: yocto-patches@lists.yoctoproject.org
> Subject: Re: [AUH] [PATCH] upgrade-helper: add state module and --
> incremental mode
> 
> On Mon, 1 Jun 2026 at 14:04, <daniel.turull@ericsson.com> wrote:
> > Without this, every AUH run retries all candidates regardless of previous
> > outcomes, wasting time on recipes that consistently fail or were already
> > upgraded recently.
> 
> Thanks for working on these, much appreciated! Comments below.
> 
> > +    def should_skip(self, pn, version):
> > +        entry = self.data.get(pn, {}).get(version)
> > +        if not entry:
> > +            return False
> > +        age = time.time() - entry.get("timestamp", 0)
> > +        if entry.get("result") == RESULT_SUCCESS:
> > +            return age < self.success_max_age
> > +        return age < self.cooldown
> > +
> > +    def _prune(self):
> > +        """Remove stale entries older than their respective max-age."""
> > +        now = time.time()
> > +        for pn in list(self.data):
> > +            versions = self.data[pn]
> > +            for ver in list(versions):
> > +                entry = versions[ver]
> > +                age = now - entry.get("timestamp", 0)
> > +                if entry.get("result") == RESULT_SUCCESS:
> > +                    if age > self.success_max_age:
> > +                        del versions[ver]
> > +                else:
> > +                    if age > self.cooldown:
> > +                        del versions[ver]
> > +            if not versions:
> > +                del self.data[pn]
> 
> I am somewhat confused as to how these functions complement each
> other, because they seem to repeat each other's timestamp logic. If
> entries that are too old are already pruned at the start, then the
> remaining entries will always have their age less than maximum
> periods, and it's not necessary to compare times in should_skip(), and
> it should simply return True. No?

Yes, my first iteration had only should_skip and then I added the prune. 
Your suggestion is cleaner. I'll fix it in v2.

> 
> I also think that any 'decisions' regarding pruning, and skipping
> should be at least printed out, so that the log contains information
> about why something wasn't attempted, or why it was attempted again,
> including information about when it was attempted previously and what
> was the result then.

I'll add a log entry for it in v2.

> 
> > @@ -520,6 +542,10 @@ class Updater(object):
> >                  g['error'] = e
> >                  failed_pkggroups_ctx.append(g)
> >
> > +            # Capture upgrade error before commit_changes() which may
> > +            # overwrite g['error'] with a git commit failure.
> > +            upgrade_err = g.get('error') if self.state else None
> > +
> 
> Such special-casing always makes me a bit uneasy. What is the scenario
> where this matters? Did you hit that in testing? If the upgrade itself
> succeeded, but the commit failed, then the overall result is 'failed',
> and we shouldn't roll it back.

I cannot reproduce it now. I was over conservative with the failures.  I'll drop it.
> 
> > +            if self.state:
> > +                # UpgradeNotNeededError means the recipe is already current,
> > +                # which is a successful outcome — no retry needed.
> > +                if not upgrade_err or isinstance(upgrade_err,
> UpgradeNotNeededError):
> > +                    result = RESULT_SUCCESS
> > +                else:
> > +                    result = RESULT_FAILURE
> 
> Looking at code, there isn't any place where UpgradeNotNeededError is
> actually raised, so you could probably add a patch that removes it
> altogether for lack of use?

I'll create a patch to remove it and remove from the incremental code.

I'll rerun the full test with today's master and I'll send a v2 when I'm happy with the testing.

Thanks,
Daniel
diff mbox series

Patch

diff --git a/modules/state.py b/modules/state.py
new file mode 100644
index 0000000..01bc18f
--- /dev/null
+++ b/modules/state.py
@@ -0,0 +1,102 @@ 
+# SPDX-License-Identifier: GPL-2.0-or-later
+# vim: set ts=4 sw=4 et:
+#
+# Copyright (c) 2026 Ericsson AB
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+#
+# AUTHORS
+# Daniel Turull   <daniel.turull@ericsson.com>
+#
+
+import json
+import os
+import time
+
+from logging import warning as W
+
+RESULT_SUCCESS = "success"
+RESULT_FAILURE = "failure"
+
+STATE_FILENAME = "auh-state.json"
+STATE_VERSION = 1
+
+SECONDS_PER_DAY = 86400
+DEFAULT_COOLDOWN_DAYS = 30
+DEFAULT_SUCCESS_MAX_AGE_DAYS = 30
+
+
+class State:
+    """Tracks upgrade attempts as {recipe: {version: {timestamp, result}}}."""
+
+    def __init__(self, state_dir, cooldown_days=DEFAULT_COOLDOWN_DAYS,
+                 success_max_age_days=DEFAULT_SUCCESS_MAX_AGE_DAYS):
+        self.path = os.path.join(state_dir, STATE_FILENAME)
+        self.cooldown = cooldown_days * SECONDS_PER_DAY
+        self.success_max_age = success_max_age_days * SECONDS_PER_DAY
+        self.data = self._load()
+        self._prune()
+
+    def _load(self):
+        if os.path.exists(self.path):
+            try:
+                with open(self.path) as f:
+                    raw = json.load(f)
+            except (json.JSONDecodeError, OSError) as e:
+                W(" %s is corrupt (%s), starting fresh" % (self.path, e))
+                return {}
+            if not isinstance(raw, dict) or raw.get("version") != STATE_VERSION:
+                W(" %s: unsupported or missing version, starting fresh"
+                  % self.path)
+                return {}
+            return raw.get("recipes", {})
+        return {}
+
+    def save(self):
+        with open(self.path, "w") as f:
+            json.dump({"version": STATE_VERSION, "recipes": self.data},
+                      f, indent=2)
+
+    def record(self, pn, version, result):
+        entry = {"timestamp": time.time(), "result": result}
+        if pn not in self.data:
+            self.data[pn] = {}
+        self.data[pn][version] = entry
+
+    def should_skip(self, pn, version):
+        entry = self.data.get(pn, {}).get(version)
+        if not entry:
+            return False
+        age = time.time() - entry.get("timestamp", 0)
+        if entry.get("result") == RESULT_SUCCESS:
+            return age < self.success_max_age
+        return age < self.cooldown
+
+    def _prune(self):
+        """Remove stale entries older than their respective max-age."""
+        now = time.time()
+        for pn in list(self.data):
+            versions = self.data[pn]
+            for ver in list(versions):
+                entry = versions[ver]
+                age = now - entry.get("timestamp", 0)
+                if entry.get("result") == RESULT_SUCCESS:
+                    if age > self.success_max_age:
+                        del versions[ver]
+                else:
+                    if age > self.cooldown:
+                        del versions[ver]
+            if not versions:
+                del self.data[pn]
diff --git a/upgrade-helper.conf b/upgrade-helper.conf
index 269bde3..c5dfdde 100644
--- a/upgrade-helper.conf
+++ b/upgrade-helper.conf
@@ -50,7 +50,15 @@ 
 # passed; does not apply when layer_mode is enabled).
 #blacklist=python glibc gcc
 
-# specify the directory where work (patches) will be saved 
+# When running with --incremental, how many days to wait before retrying
+# a failed upgrade attempt for the same recipe version. Default is 30 days.
+#retry_interval=30
+
+# When running with --incremental, how many days to skip a successfully
+# upgraded recipe version before attempting it again. Default is 30 days.
+#success_max_age=30
+
+# specify the directory where work (patches) will be saved
 # (optional; default is BUILDDIR/upgrade-helper/)
 #workdir=
 
diff --git a/upgrade-helper.py b/upgrade-helper.py
index 40f31c4..339f096 100755
--- a/upgrade-helper.py
+++ b/upgrade-helper.py
@@ -59,6 +59,8 @@  from utils.emailhandler import Email
 from statistics import Statistics
 from steps import upgrade_steps
 from testimage import TestImage
+from state import (State, RESULT_SUCCESS, RESULT_FAILURE,
+                   DEFAULT_COOLDOWN_DAYS, DEFAULT_SUCCESS_MAX_AGE_DAYS)
 
 if not os.getenv('BUILDDIR', False):
     E(" You must source oe-init-build-env before running this script!\n")
@@ -104,6 +106,8 @@  def parse_cmdline():
                         help="do not compile, just change the checksums, remove PR, and commit")
     parser.add_argument("-c", "--config-file", default=None,
                         help="Path to the configuration file. Default is $BUILDDIR/upgrade-helper/upgrade-helper.conf")
+    parser.add_argument("--incremental", action="store_true",
+                        help="skip recipes already attempted (uses JSON state file)")
     parser.add_argument("--layer-names", nargs='*', action="store", default='',
                         help="layers to include in the upgrade research")
     parser.add_argument("--layer-dir", action="store", default='',
@@ -169,6 +173,16 @@  class Updater(object):
             self.email_handler = Email(settings)
         self.statistics = Statistics()
 
+        if self.args.incremental:
+            cooldown = int(settings.get('retry_interval', DEFAULT_COOLDOWN_DAYS))
+            success_max_age = int(settings.get('success_max_age',
+                                               DEFAULT_SUCCESS_MAX_AGE_DAYS))
+            self.state = State(self.uh_dir,
+                               cooldown_days=cooldown,
+                               success_max_age_days=success_max_age)
+        else:
+            self.state = None
+
     def _set_options(self):
         self.opts = {}
         self.opts['layer_mode'] = settings.get('layer_mode', '')
@@ -468,6 +482,14 @@  class Updater(object):
 
             pkggroups_ctx.append({"name":",".join([pkg_ctx['PN'] for pkg_ctx in pkgs_ctx]),"pkgs":pkgs_ctx,"error":None, 'base_dir':self.uh_recipes_all_dir})
         I(" ############################################################")
+        if self.state:
+            before = len(pkggroups_ctx)
+            pkggroups_ctx = [g for g in pkggroups_ctx
+                             if not self.state.should_skip(
+                                 g['pkgs'][0]['PN'], g['pkgs'][0]['NPV'])]
+            I(" %d/%d package groups skipped (incremental mode)"
+              % (before - len(pkggroups_ctx), before))
+            total_pkggroups = len(pkggroups_ctx)
         if pkggroups_ctx and not self.args.skip_compilation:
             I(" Building gcc runtimes ...")
             for machine in self.opts['machines']:
@@ -520,6 +542,10 @@  class Updater(object):
                 g['error'] = e
                 failed_pkggroups_ctx.append(g)
 
+            # Capture upgrade error before commit_changes() which may
+            # overwrite g['error'] with a git commit failure.
+            upgrade_err = g.get('error') if self.state else None
+
             try:
                 self.commit_changes(g)
             except Exception as e:
@@ -534,6 +560,21 @@  class Updater(object):
                     succeeded_pkggroups_ctx.remove(g)
                     failed_pkggroups_ctx.append(g)
 
+            if self.state:
+                # UpgradeNotNeededError means the recipe is already current,
+                # which is a successful outcome — no retry needed.
+                if not upgrade_err or isinstance(upgrade_err, UpgradeNotNeededError):
+                    result = RESULT_SUCCESS
+                else:
+                    result = RESULT_FAILURE
+                # All packages in a group share the same result because AUH
+                # upgrades them atomically; individual outcomes are not tracked.
+                for pkg_ctx in g['pkgs']:
+                    self.state.record(pkg_ctx['PN'], pkg_ctx['NPV'], result)
+
+        if self.state:
+            self.state.save()
+
         if self.opts['testimage']:
             ctxs = {}
             ctxs['succeeded'] = succeeded_pkggroups_ctx