Message ID | 20230224164555.67634-1-alexis.lothore@bootlin.com |
---|---|
Headers | show |
Series | scripts/resulttool/regression: add metadata filtering | expand |
Hi Alexis, Firstly, this looks very much improved, thanks. It is great to start to see some meaningful data from this. On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via lists.openembedded.org wrote: > From: Alexis Lothoré <alexis.lothore@bootlin.com> > > Hello, > this new series is the follow-up of [1] to make regression reports more > meaningful, by reducing noise and false positives. > > Change since v2: > - add filtering on MACHINE field from test results configuration: the MACHINE > should always match > - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the > point where this series will be merged, tests results stored in git are not > enriched with OESELFTEST_METADATA. To allow proper test comparison even with > those tests, try to guess what oeselftest command line has been used to run > the corresponding tests, and generate OESELFTEST_METADATA accordingly > - add new tool to ease test results usage: yocto_testresults_query. For now the > tool only manages regression report and is a thin layer between send-qa-email > (in yocto-autobuilder-helper) and resulttool. Its main role is to translate > regression reports arguments (which are tags or branches) to fixed revisions > and to call resulttool accordingly. Most of its code is a transfer from > send-qa-email (another series for the autobuilder will follow this one to make > send-qa-email use this new helper, but this current series works > independently) > Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will > replay the regression report generated when the 4.2_M2 has been generated. > > Change since v1: > - properly configure "From" field in series > > With those improvements, the regression report is significantly reduced and some > useful data start to emerge from the removed noise: > - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB > - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests, > the report goes from 627MB to 1.5MB That is just a bit more readable! > > After manual inspection on some entries, the remaining oeselftest regression > raised in the report seems valid. There are still some issues to tackle: > - it seems that now one major remaining source of noise is on the "runtime" > tests (comparison to tests not run on "target" results) > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not > run, so when 1 failure is logged, we have many "PASSED->None" transitions in > regression report, we should probably silence it. > - some transitions appear as regression while those are in fact improvements > (e.g: "UNRESOLVED->PASSED") I had quick play. Firstly, if I try "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" in an openembedded-core repository instead of poky, it breaks. That isn't surprising but we should either make it work or show a sensible error. I also took a look the report and wondered why the matching isn't quite right and why we have these "regressions". If we could remove that noise, I think we'd get down to the real issues. I ended up doing: resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e . > /tmp/repa resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211 . > /tmp/repb meld /tmp/rep* which was interesting as gave lots of warnings like: "Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64" so it looks like we had a couple of different test runs for qemuarm64 ptests which is confusing your new code. I suspect this happened due to some autobuilder glitch during the release build which restarted some of the build pieces. Not sure how to handle that yet, I'll give it some further thought but I wanted to share what I think is the source of some of the issues. Basically we need to get the regression report looking more like that meld output! Cheers, Richard
On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via lists.openembedded.org wrote: > Hi Alexis, > > Firstly, this looks very much improved, thanks. It is great to start to > see some meaningful data from this. > > On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via > lists.openembedded.org wrote: > > From: Alexis Lothoré <alexis.lothore@bootlin.com> > > > > Hello, > > this new series is the follow-up of [1] to make regression reports more > > meaningful, by reducing noise and false positives. > > > > Change since v2: > > - add filtering on MACHINE field from test results configuration: the MACHINE > > should always match > > - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the > > point where this series will be merged, tests results stored in git are not > > enriched with OESELFTEST_METADATA. To allow proper test comparison even with > > those tests, try to guess what oeselftest command line has been used to run > > the corresponding tests, and generate OESELFTEST_METADATA accordingly > > - add new tool to ease test results usage: yocto_testresults_query. For now the > > tool only manages regression report and is a thin layer between send-qa-email > > (in yocto-autobuilder-helper) and resulttool. Its main role is to translate > > regression reports arguments (which are tags or branches) to fixed revisions > > and to call resulttool accordingly. Most of its code is a transfer from > > send-qa-email (another series for the autobuilder will follow this one to make > > send-qa-email use this new helper, but this current series works > > independently) > > Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will > > replay the regression report generated when the 4.2_M2 has been generated. > > > > Change since v1: > > - properly configure "From" field in series > > > > With those improvements, the regression report is significantly reduced and some > > useful data start to emerge from the removed noise: > > - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB > > - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests, > > the report goes from 627MB to 1.5MB > > That is just a bit more readable! > > > > > After manual inspection on some entries, the remaining oeselftest regression > > raised in the report seems valid. There are still some issues to tackle: > > - it seems that now one major remaining source of noise is on the "runtime" > > tests (comparison to tests not run on "target" results) > > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not > > run, so when 1 failure is logged, we have many "PASSED->None" transitions in > > regression report, we should probably silence it. > > - some transitions appear as regression while those are in fact improvements > > (e.g: "UNRESOLVED->PASSED") > > I had quick play. Firstly, if I try "yocto_testresults_query.py > regression-report 4.2_M1 4.2_M2" in an openembedded-core repository > instead of poky, it breaks. That isn't surprising but we should either > make it work or show a sensible error. > > I also took a look the report and wondered why the matching isn't quite > right and why we have these "regressions". If we could remove that > noise, I think we'd get down to the real issues. I ended up doing: > > resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e . > /tmp/repa > resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211 . > /tmp/repb > meld /tmp/rep* > > which was interesting as gave lots of warnings like: > > "Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64" > > so it looks like we had a couple of different test runs for qemuarm64 > ptests which is confusing your new code. I suspect this happened due to > some autobuilder glitch during the release build which restarted some > of the build pieces. Not sure how to handle that yet, I'll give it some > further thought but I wanted to share what I think is the source of > some of the issues. Basically we need to get the regression report > looking more like that meld output! I was wrong about the duplication, that isn't the issue, or at least I found some other more pressing ones. For the ltp issue, I found an easy fix: diff --git a/scripts/lib/resulttool/regression.py b/scripts/lib/resulttool/regression.py index 1b0c8335a39..9d7c35942a6 100644 --- a/scripts/lib/resulttool/regression.py +++ b/scripts/lib/resulttool/regression.py @@ -146,6 +146,7 @@ def can_be_compared(logger, base, target): run with different tests sets or parameters. Return true if tests can be compared """ + ret = True base_configuration = base['configuration'] target_configuration = target['configuration'] @@ -165,7 +166,10 @@ def can_be_compared(logger, base, target): logger.debug(f"Enriching {target_configuration['STARTTIME']} with {guess}") target_configuration['OESELFTEST_METADATA'] = guess - return metadata_matches(base_configuration, target_configuration) \ + if base_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in base['result']): + ret = target_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in target['result']) + + return ret and metadata_matches(base_configuration, target_configuration) \ and machine_matches(base_configuration, target_configuration) i.e. only compare ltp to ltp. The issue is we don't use a special image name for the ltp test runs, we just extend a standard one so it was comparing ltp to non-ltp. We should also perhaps consider a clause in there which only compares runs with ptests with other runs with ptests? Our test matrix won't trigger that but other usage might in future and it is a safe check? A lot of the rest of the noise is poor test naming for ptests, e.g.: ptestresult.lttng-tools.ust/buffers-pid/test_buffers_pid_10_-_Create_session_buffers-pid_in_-o_/tmp/tmp.XXXXXXXXXXrs_pid_trace_path.XTnDY5 which has a random string at the end. I'm wondering if we should pre- filter ptest result names and truncate a known list of them at the "-" (lttng-tools, babeltrace, babeltrace2). Curl could also be truncated at the ",": ptestresult.curl.test_0010__10_out_of_1506,_remaining:_06:44,_took_1.075s,_duration:_00:02_ We can adjust the ptest generation code to do this at source (we should perhaps file a bug for that for the four above?) but that won't fix the older results so we'll probably need some filtering in the code too. There is something more going on with the ptest results too, I don't understand why quilt/python3 changed but I suspect we just have to go through the issues step by step now. I did look into the: ptestresult.glibc-user.debug/tst-fortify-c-default-1 'regression' and it is because the test was renamed in the new glibc. I was therefore thinking a summary of added/removed would be useful in but only in these cases. Something along the line of if only tests added, just summarise X new added and call it a match. If tests removed and added, list and show a count summary (X removed, Y added) and call it a regression. I think I might be tempted to merge this series and then we can change the code to improve from here as this is clearly a vast improvement on where we were! Improvements can be incremental on top of these changes. Cheers, Richard
On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via lists.openembedded.org wrote: > On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via > lists.openembedded.org wrote: > > Hi Alexis, > > > > Firstly, this looks very much improved, thanks. It is great to start to > > see some meaningful data from this. > > > > On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via > > lists.openembedded.org wrote: > > > From: Alexis Lothoré <alexis.lothore@bootlin.com> > > > > > > Hello, > > > this new series is the follow-up of [1] to make regression reports more > > > meaningful, by reducing noise and false positives. > > > > > > Change since v2: > > > - add filtering on MACHINE field from test results configuration: the MACHINE > > > should always match > > > - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the > > > point where this series will be merged, tests results stored in git are not > > > enriched with OESELFTEST_METADATA. To allow proper test comparison even with > > > those tests, try to guess what oeselftest command line has been used to run > > > the corresponding tests, and generate OESELFTEST_METADATA accordingly > > > - add new tool to ease test results usage: yocto_testresults_query. For now the > > > tool only manages regression report and is a thin layer between send-qa-email > > > (in yocto-autobuilder-helper) and resulttool. Its main role is to translate > > > regression reports arguments (which are tags or branches) to fixed revisions > > > and to call resulttool accordingly. Most of its code is a transfer from > > > send-qa-email (another series for the autobuilder will follow this one to make > > > send-qa-email use this new helper, but this current series works > > > independently) > > > Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will > > > replay the regression report generated when the 4.2_M2 has been generated. > > > > > > Change since v1: > > > - properly configure "From" field in series > > > > > > With those improvements, the regression report is significantly reduced and some > > > useful data start to emerge from the removed noise: > > > - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB > > > - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests, > > > the report goes from 627MB to 1.5MB > > > > That is just a bit more readable! > > > > > > > > After manual inspection on some entries, the remaining oeselftest regression > > > raised in the report seems valid. There are still some issues to tackle: > > > - it seems that now one major remaining source of noise is on the "runtime" > > > tests (comparison to tests not run on "target" results) > > > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not > > > run, so when 1 failure is logged, we have many "PASSED->None" transitions in > > > regression report, we should probably silence it. > > > - some transitions appear as regression while those are in fact improvements > > > (e.g: "UNRESOLVED->PASSED") > > > > I had quick play. Firstly, if I try "yocto_testresults_query.py > > regression-report 4.2_M1 4.2_M2" in an openembedded-core repository > > instead of poky, it breaks. That isn't surprising but we should either > > make it work or show a sensible error. > > > > I also took a look the report and wondered why the matching isn't quite > > right and why we have these "regressions". If we could remove that > > noise, I think we'd get down to the real issues. I ended up doing: > > > > resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e . > /tmp/repa > > resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211 . > /tmp/repb > > meld /tmp/rep* > > > > which was interesting as gave lots of warnings like: > > > > "Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64" > > > > so it looks like we had a couple of different test runs for qemuarm64 > > ptests which is confusing your new code. I suspect this happened due to > > some autobuilder glitch during the release build which restarted some > > of the build pieces. Not sure how to handle that yet, I'll give it some > > further thought but I wanted to share what I think is the source of > > some of the issues. Basically we need to get the regression report > > looking more like that meld output! > > I was wrong about the duplication, that isn't the issue, or at least I > found some other more pressing ones. For the ltp issue, I found an easy > fix: > > diff --git a/scripts/lib/resulttool/regression.py b/scripts/lib/resulttool/regression.py > index 1b0c8335a39..9d7c35942a6 100644 > --- a/scripts/lib/resulttool/regression.py > +++ b/scripts/lib/resulttool/regression.py > @@ -146,6 +146,7 @@ def can_be_compared(logger, base, target): > run with different tests sets or parameters. Return true if tests can be > compared > """ > + ret = True > base_configuration = base['configuration'] > target_configuration = target['configuration'] > > @@ -165,7 +166,10 @@ def can_be_compared(logger, base, target): > logger.debug(f"Enriching {target_configuration['STARTTIME']} with {guess}") > target_configuration['OESELFTEST_METADATA'] = guess > > - return metadata_matches(base_configuration, target_configuration) \ > + if base_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in base['result']): > + ret = target_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in target['result']) > + > + return ret and metadata_matches(base_configuration, target_configuration) \ > and machine_matches(base_configuration, target_configuration) > > > i.e. only compare ltp to ltp. The issue is we don't use a special image > name for the ltp test runs, we just extend a standard one so it was > comparing ltp to non-ltp. > > We should also perhaps consider a clause in there which only compares > runs with ptests with other runs with ptests? Our test matrix won't > trigger that but other usage might in future and it is a safe check? > > A lot of the rest of the noise is poor test naming for ptests, e.g.: > > ptestresult.lttng-tools.ust/buffers-pid/test_buffers_pid_10_-_Create_session_buffers-pid_in_-o_/tmp/tmp.XXXXXXXXXXrs_pid_trace_path.XTnDY5 > > which has a random string at the end. I'm wondering if we should pre- > filter ptest result names and truncate a known list of them at the "-" > (lttng-tools, babeltrace, babeltrace2). Curl could also be truncated at > the ",": > > ptestresult.curl.test_0010__10_out_of_1506,_remaining:_06:44,_took_1.075s,_duration:_00:02_ > > We can adjust the ptest generation code to do this at source (we should > perhaps file a bug for that for the four above?) but that won't fix the > older results so we'll probably need some filtering in the code too. > > There is something more going on with the ptest results too, I don't > understand why quilt/python3 changed but I suspect we just have to go > through the issues step by step now. > > I did look into the: > > ptestresult.glibc-user.debug/tst-fortify-c-default-1 > > 'regression' and it is because the test was renamed in the new glibc. I > was therefore thinking a summary of added/removed would be useful in > but only in these cases. Something along the line of if only tests > added, just summarise X new added and call it a match. If tests removed > and added, list and show a count summary (X removed, Y added) and call > it a regression. > > I think I might be tempted to merge this series and then we can change > the code to improve from here as this is clearly a vast improvement on > where we were! Improvements can be incremental on top of these changes. This goes a long way to shrinking the report even further. Looks like the curl test reporting needs some work as the IDs look like they change but this at least makes the issue clearer and the real deltas are becoming much easier to see outside the noise. diff --git a/scripts/lib/resulttool/regression.py b/scripts/lib/resulttool/regression.py index 1b0c8335a39..0d8948f012f 100644 --- a/scripts/lib/resulttool/regression.py +++ b/scripts/lib/resulttool/regression.py @@ -243,6 +247,21 @@ def regression_common(args, logger, base_results, target_results): return 0 +def fixup_ptest_names(results, logger): + for r in results: + for i in results[r]: + tests = list(results[r][i]['result'].keys()) + for test in tests: + new = None + if test.startswith(("ptestresult.lttng-tools.", "ptestresult.babeltrace.", "ptestresult.babeltrace2")) and "_-_" in test: + new = test.split("_-_")[0] + elif test.startswith(("ptestresult.curl.")) and "__" in test: + new = test.split("__")[0] + if new: + results[r][i]['result'][new] = results[r][i]['result'][test] + del results[r][i]['result'][test] + + def regression_git(args, logger): base_results = {} target_results = {} @@ -304,6 +323,9 @@ def regression_git(args, logger): base_results = resultutils.git_get_result(repo, revs[index1][2]) target_results = resultutils.git_get_result(repo, revs[index2][2]) + fixup_ptest_names(base_results, logger) + fixup_ptest_names(target_results, logger) + regression_common(args, logger, base_results, target_results) return 0
I'll try and stop poking at this but it is all rather interesting and I think we have spotted our first nasty regression. The quilt ptests did really stop running properly and reporting test results! Looking at a recent master report: https://autobuilder.yocto.io/pub/non-release/20230224-14/testresults/testresult-report.txt you can see the quilt ptest count is still zero as it was in M2 but not in M1. I'm thinking Ross might have been responsible with: https://git.yoctoproject.org/poky/commit/?id=61bb4d8e75dfaaf980c32fbe992d34f794b7c537 ! The reason there are python3 and python3-cryptography changes are that there are 35,000 python3 ptests and about 250 changes so that isn't really unexpected for what was probably a version change. I'd guess python3-cryptography made changes to their test suite. Overall the counts of pass rates increased and no failures so those overall counts may be good to have in the report too. We might want to log the version of the recipe in the data somewhere so we can show if this was a python3 version change. Cheers, Richard
Hello Richard, as usual, thanks for the prompt feedback ! On 2/25/23 13:32, Richard Purdie wrote: > On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via > lists.openembedded.org wrote: >> On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via >> lists.openembedded.org wrote: >>> Hi Alexis, >>> >>> Firstly, this looks very much improved, thanks. It is great to start to >>> see some meaningful data from this. >>> >>> On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via >>> lists.openembedded.org wrote: >>>> After manual inspection on some entries, the remaining oeselftest regression >>>> raised in the report seems valid. There are still some issues to tackle: >>>> - it seems that now one major remaining source of noise is on the "runtime" >>>> tests (comparison to tests not run on "target" results) >>>> - when a ptest managed by oe-selftest fails, I guess the remaining tests are not >>>> run, so when 1 failure is logged, we have many "PASSED->None" transitions in >>>> regression report, we should probably silence it. >>>> - some transitions appear as regression while those are in fact improvements >>>> (e.g: "UNRESOLVED->PASSED") >>> >>> I had quick play. Firstly, if I try "yocto_testresults_query.py >>> regression-report 4.2_M1 4.2_M2" in an openembedded-core repository >>> instead of poky, it breaks. That isn't surprising but we should either >>> make it work or show a sensible error. Oh right, I am working in a Poky build configuration, so I have assumed that this would be the unique use case. Since the test results commits are tightly coupled to revisions in poky (so not oecore), I plan to merely log an error about not found revision (and suggesting the user to check that the repository is poky and not oecore). But please let me know if I miss a major use case here and that a smarter fallback plan (shallow-clone poky if we are running in oecore ?) is needed >> I think I might be tempted to merge this series and then we can change >> the code to improve from here as this is clearly a vast improvement on >> where we were! Improvements can be incremental on top of these changes. I am in favor of this :) If it is OK for you, I will just re-submit a series with the fix for the proper error logging when running the tool from oecore and not poky. Next we could introduce all the suggestions you have suggested, but I feel that with the quick increase of "hotfixes" count to support issues with older test results, and for the sake of maintainability of resulttool and its submodules, those specific hotfixes need to be properly isolated (and documented), like in a "regression_quirks.py" or something like that. What do you think ? Alexis
On Sat, 2023-02-25 at 16:59 +0100, Alexis Lothoré wrote: > Hello Richard, > as usual, thanks for the prompt feedback ! > > On 2/25/23 13:32, Richard Purdie wrote: > > On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via > > lists.openembedded.org wrote: > > > On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via > > > lists.openembedded.org wrote: > > > > Hi Alexis, > > > > > > > > Firstly, this looks very much improved, thanks. It is great to start to > > > > see some meaningful data from this. > > > > > > > > On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via > > > > lists.openembedded.org wrote: > > > > > After manual inspection on some entries, the remaining oeselftest regression > > > > > raised in the report seems valid. There are still some issues to tackle: > > > > > - it seems that now one major remaining source of noise is on the "runtime" > > > > > tests (comparison to tests not run on "target" results) > > > > > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not > > > > > run, so when 1 failure is logged, we have many "PASSED->None" transitions in > > > > > regression report, we should probably silence it. > > > > > - some transitions appear as regression while those are in fact improvements > > > > > (e.g: "UNRESOLVED->PASSED") > > > > > > > > I had quick play. Firstly, if I try "yocto_testresults_query.py > > > > regression-report 4.2_M1 4.2_M2" in an openembedded-core repository > > > > instead of poky, it breaks. That isn't surprising but we should either > > > > make it work or show a sensible error. > > Oh right, I am working in a Poky build configuration, so I have assumed that this > would be the unique use case. > Since the test results commits are tightly coupled to revisions in poky (so not > oecore), I plan to merely log an error about not found revision (and suggesting > the user to check that the repository is poky and not oecore). > But please let me know if I miss a major use case here and that a smarter > fallback plan (shallow-clone poky if we are running in oecore ?) is needed I'm happy to for it just to give an human readable error, someone can add this functionality if they need/want it. > > > I think I might be tempted to merge this series and then we can change > > > the code to improve from here as this is clearly a vast improvement on > > > where we were! Improvements can be incremental on top of these changes. > > I am in favor of this :) If it is OK for you, I will just re-submit a series with > the fix for the proper error logging when running the tool from oecore and not poky. > > Next we could introduce all the suggestions you have suggested, but I feel that > with the quick increase of "hotfixes" count to support issues with older test > results, and for the sake of maintainability of resulttool and its submodules, > those specific hotfixes need to be properly isolated (and documented), like in a > "regression_quirks.py" or something like that. What do you think ? I'm hoping we don't have many of these quirks. We have a huge history at this point so it would be sad if the tool can't work with it. From what I've seen so far, we can manage with the code in the regression module itself. I've tried to add some comments. I wondered what to do with this series since I needed to get M3 built. Since this series was available and mostly usable, it would be better to have a nicer report this time, it is a good test of the code. In the end I've merged most of it, along with my two tweaks to handle LTP and the bigger ptest results issue. I couldn't take one set of the selftests since they simply don't work. This will give us a useful realworld test of the M3 report. I'm working on the assumption you'll send a follow up series with the tests, the oe-core check and some of the other issues I've mentioned in other emails? Cheers, Richard
Hello Richard, On 2/26/23 13:15, Richard Purdie wrote: > On Sat, 2023-02-25 at 16:59 +0100, Alexis Lothoré wrote: >> Hello Richard, >> as usual, thanks for the prompt feedback ! >> >> On 2/25/23 13:32, Richard Purdie wrote: >>> On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via >>> lists.openembedded.org wrote: >>>> On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via >>>> lists.openembedded.org wrote: >>>>> Hi Alexis, >>>>> >>>>> Firstly, this looks very much improved, thanks. It is great to start to >>>>> see some meaningful data from this. >>>>> >>>>> On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via >>>>> lists.openembedded.org wrote: >>>>>> After manual inspection on some entries, the remaining oeselftest regression >>>>>> raised in the report seems valid. There are still some issues to tackle: >>>>>> - it seems that now one major remaining source of noise is on the "runtime" >>>>>> tests (comparison to tests not run on "target" results) >>>>>> - when a ptest managed by oe-selftest fails, I guess the remaining tests are not >>>>>> run, so when 1 failure is logged, we have many "PASSED->None" transitions in >>>>>> regression report, we should probably silence it. >>>>>> - some transitions appear as regression while those are in fact improvements >>>>>> (e.g: "UNRESOLVED->PASSED") >>>>> >>>>> I had quick play. Firstly, if I try "yocto_testresults_query.py >>>>> regression-report 4.2_M1 4.2_M2" in an openembedded-core repository >>>>> instead of poky, it breaks. That isn't surprising but we should either >>>>> make it work or show a sensible error. >> >> Oh right, I am working in a Poky build configuration, so I have assumed that this >> would be the unique use case. >> Since the test results commits are tightly coupled to revisions in poky (so not >> oecore), I plan to merely log an error about not found revision (and suggesting >> the user to check that the repository is poky and not oecore). >> But please let me know if I miss a major use case here and that a smarter >> fallback plan (shallow-clone poky if we are running in oecore ?) is needed > > I'm happy to for it just to give an human readable error, someone can > add this functionality if they need/want it. ACK >>>> I think I might be tempted to merge this series and then we can change >>>> the code to improve from here as this is clearly a vast improvement on >>>> where we were! Improvements can be incremental on top of these changes. >> >> I am in favor of this :) If it is OK for you, I will just re-submit a series with >> the fix for the proper error logging when running the tool from oecore and not poky. >> >> Next we could introduce all the suggestions you have suggested, but I feel that >> with the quick increase of "hotfixes" count to support issues with older test >> results, and for the sake of maintainability of resulttool and its submodules, >> those specific hotfixes need to be properly isolated (and documented), like in a >> "regression_quirks.py" or something like that. What do you think ? > > I'm hoping we don't have many of these quirks. We have a huge history > at this point so it would be sad if the tool can't work with it. From > what I've seen so far, we can manage with the code in the regression > module itself. I've tried to add some comments. > > I wondered what to do with this series since I needed to get M3 built. > Since this series was available and mostly usable, it would be better > to have a nicer report this time, it is a good test of the code. > > In the end I've merged most of it, along with my two tweaks to handle > LTP and the bigger ptest results issue. I couldn't take one set of the > selftests since they simply don't work. This will give us a useful > realworld test of the M3 report. Ok, great. For the selftest failing, my bad, adding proper logging was one of those "one last change before sending", and obviously I did forget to re-run the tests before sending. > > I'm working on the assumption you'll send a follow up series with the > tests, the oe-core check and some of the other issues I've mentioned in > other emails? Absolutely. Besides the tests, oecore check and the improvements mentioned in this mail thread, the next thing I was keeping in mind is fixing the report generation against "master-next" branches you have mentioned a few weeks ago. Regards,
On 25 Feb 2023, at 12:44, Richard Purdie via lists.openembedded.org <richard.purdie=linuxfoundation.org@lists.openembedded.org> wrote: > > I'll try and stop poking at this but it is all rather interesting and I > think we have spotted our first nasty regression. The quilt ptests did > really stop running properly and reporting test results! > > Looking at a recent master report: > > https://autobuilder.yocto.io/pub/non-release/20230224-14/testresults/testresult-report.txt > > you can see the quilt ptest count is still zero as it was in M2 but not > in M1. > > I'm thinking Ross might have been responsible with: > > https://git.yoctoproject.org/poky/commit/?id=61bb4d8e75dfaaf980c32fbe992d34f794b7c537 > > ! Grumble. Yes, I was. Fixes sent. Cheers, Ross
On Sun, 2023-02-26 at 16:42 +0100, Alexis Lothoré wrote: > > I'm hoping we don't have many of these quirks. We have a huge history > > at this point so it would be sad if the tool can't work with it. From > > what I've seen so far, we can manage with the code in the regression > > module itself. I've tried to add some comments. > > > > I wondered what to do with this series since I needed to get M3 built. > > Since this series was available and mostly usable, it would be better > > to have a nicer report this time, it is a good test of the code. > > > > In the end I've merged most of it, along with my two tweaks to handle > > LTP and the bigger ptest results issue. I couldn't take one set of the > > selftests since they simply don't work. This will give us a useful > > realworld test of the M3 report. > > Ok, great. For the selftest failing, my bad, adding proper logging was one of > those "one last change before sending", and obviously I did forget to re-run the > tests before sending. > > > > I'm working on the assumption you'll send a follow up series with the > > tests, the oe-core check and some of the other issues I've mentioned in > > other emails? > > Absolutely. Besides the tests, oecore check and the improvements mentioned in > this mail thread, the next thing I was keeping in mind is fixing the report > generation against "master-next" branches you have mentioned a few weeks ago. FWIW, here is the result from M3 rc1: https://autobuilder.yocto.io/pub/releases/yocto-4.2_M3.rc1/testresults/testresult-regressions-report.txt It looks like a couple of runs were mismatched so the ltp issue isn't quite resolved with my tweak. All in all it is definitely a lot more readable than it was though so progress :) Looks like dbus may need the same fix as curl. Cheers, Richard
From: Alexis Lothoré <alexis.lothore@bootlin.com> Hello, this new series is the follow-up of [1] to make regression reports more meaningful, by reducing noise and false positives. Change since v2: - add filtering on MACHINE field from test results configuration: the MACHINE should always match - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the point where this series will be merged, tests results stored in git are not enriched with OESELFTEST_METADATA. To allow proper test comparison even with those tests, try to guess what oeselftest command line has been used to run the corresponding tests, and generate OESELFTEST_METADATA accordingly - add new tool to ease test results usage: yocto_testresults_query. For now the tool only manages regression report and is a thin layer between send-qa-email (in yocto-autobuilder-helper) and resulttool. Its main role is to translate regression reports arguments (which are tags or branches) to fixed revisions and to call resulttool accordingly. Most of its code is a transfer from send-qa-email (another series for the autobuilder will follow this one to make send-qa-email use this new helper, but this current series works independently) Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will replay the regression report generated when the 4.2_M2 has been generated. Change since v1: - properly configure "From" field in series With those improvements, the regression report is significantly reduced and some useful data start to emerge from the removed noise: - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests, the report goes from 627MB to 1.5MB After manual inspection on some entries, the remaining oeselftest regression raised in the report seems valid. There are still some issues to tackle: - it seems that now one major remaining source of noise is on the "runtime" tests (comparison to tests not run on "target" results) - when a ptest managed by oe-selftest fails, I guess the remaining tests are not run, so when 1 failure is logged, we have many "PASSED->None" transitions in regression report, we should probably silence it. - some transitions appear as regression while those are in fact improvements (e.g: "UNRESOLVED->PASSED") [1] https://lore.kernel.org/openembedded-core/20230214165309.63527-1-alexis.lothore@bootlin.com/ [2] https://lore.kernel.org/openembedded-core/124b9c9667b038b8502f6457ba7d894fc4ef3c58.camel@linuxfoundation.org/ Alexis Lothoré (6): scripts/oe-selftest: append metadata to tests results scripts/resulttool/regression: remove unused import scripts/resulttool/regression: add metadata filtering for oeselftest oeqa/selftest/resulttool: add test for metadata filtering on regression scripts: add new helper for regression report generation oeqa/selftest: add test for yocto_testresults_query.py .../oeqa/selftest/cases/resulttooltests.py | 137 +++++++++++++++ .../cases/yoctotestresultsquerytests.py | 39 +++++ meta/lib/oeqa/selftest/context.py | 15 +- scripts/lib/resulttool/regression.py | 163 +++++++++++++++++- scripts/yocto_testresults_query.py | 106 ++++++++++++ 5 files changed, 458 insertions(+), 2 deletions(-) create mode 100644 meta/lib/oeqa/selftest/cases/yoctotestresultsquerytests.py create mode 100755 scripts/yocto_testresults_query.py