recipetool/create: Scan for SDPX-License-Identifier

Message ID 20220203170724.1319808-1-saul.wold@windriver.com
State New
Headers show
Series recipetool/create: Scan for SDPX-License-Identifier | expand

Commit Message

Saul Wold Feb. 3, 2022, 5:07 p.m. UTC
When a file can not be identified by checksum and they contain an SPDX
License-Identifier tag, use it as the found license.

[YOCTO #14529]

Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags

Signed-off-by: Saul Wold <saul.wold@windriver.com>
---
 scripts/lib/recipetool/create.py | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Richard Purdie Feb. 3, 2022, 9:24 p.m. UTC | #1
On Thu, 2022-02-03 at 09:07 -0800, Saul Wold wrote:
> When a file can not be identified by checksum and they contain an SPDX
> License-Identifier tag, use it as the found license.
> 
> [YOCTO #14529]
> 
> Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags
> 
> Signed-off-by: Saul Wold <saul.wold@windriver.com>
> ---
>  scripts/lib/recipetool/create.py | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
> index 507a230511..9149c2d94f 100644
> --- a/scripts/lib/recipetool/create.py
> +++ b/scripts/lib/recipetool/create.py
> @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
>      for licfile in sorted(licfiles):
>          md5value = bb.utils.md5_file(licfile)
>          license = md5sums.get(md5value, None)
> +        license_list = []
>          if not license:
>              license, crunched_md5, lictext = crunch_license(licfile)
>              if lictext and not license:
> -                license = 'Unknown'
> -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> -                    "and replace `Unknown` with the license:\n" \
> -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> -        if license:
> +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
> +                license_list = re.findall(spdx_re, "\n".join(lictext))
> +                if not license_list:
> +                    license_list.append('Unknown')
> +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> +                        "and replace `Unknown` with the license:\n" \
> +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> +        else:
> +            license_list.append(license)
> +        for license in license_list:
>              licenses.append((license, os.path.relpath(licfile, srctree), md5value))
>  
>      # FIXME should we grab at least one source file with a license header and add that too?

I think to close this bug the code may need to go one step further and
effectively grep over the source tree. 

We'd probably want to list the value of any SPDX-License-Identifier: header
found in any of the source files for the user to then decide upon?

Or am I misunderstanding?

Cheers,

Richard
Saul Wold Feb. 3, 2022, 9:58 p.m. UTC | #2
On 2/3/22 13:24, Richard Purdie wrote:
> On Thu, 2022-02-03 at 09:07 -0800, Saul Wold wrote:
>> When a file can not be identified by checksum and they contain an SPDX
>> License-Identifier tag, use it as the found license.
>>
>> [YOCTO #14529]
>>
>> Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags
>>
>> Signed-off-by: Saul Wold <saul.wold@windriver.com>
>> ---
>>   scripts/lib/recipetool/create.py | 16 +++++++++++-----
>>   1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
>> index 507a230511..9149c2d94f 100644
>> --- a/scripts/lib/recipetool/create.py
>> +++ b/scripts/lib/recipetool/create.py
>> @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
>>       for licfile in sorted(licfiles):
>>           md5value = bb.utils.md5_file(licfile)
>>           license = md5sums.get(md5value, None)
>> +        license_list = []
>>           if not license:
>>               license, crunched_md5, lictext = crunch_license(licfile)
>>               if lictext and not license:
>> -                license = 'Unknown'
>> -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
>> -                    "and replace `Unknown` with the license:\n" \
>> -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
>> -        if license:
>> +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
>> +                license_list = re.findall(spdx_re, "\n".join(lictext))
>> +                if not license_list:
>> +                    license_list.append('Unknown')
>> +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
>> +                        "and replace `Unknown` with the license:\n" \
>> +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
>> +        else:
>> +            license_list.append(license)
>> +        for license in license_list:
>>               licenses.append((license, os.path.relpath(licfile, srctree), md5value))
>>   
>>       # FIXME should we grab at least one source file with a license header and add that too?
> 
> I think to close this bug the code may need to go one step further and
> effectively grep over the source tree.
> 
> We'd probably want to list the value of any SPDX-License-Identifier: header
> found in any of the source files for the user to then decide upon?
> 
That's moving in to the create-spdx.bbclass territory I think. The 
change would need to be much larger. and I will likely have to shelve 
for a while.

> Or am I misunderstanding?
>
Maybe it's my misunderstanding, Tim has mentioned the LICENSE related 
files in the bug report.

Sau!


> Cheers,
> 
> Richard
> 
> 
> 
> 
>
Richard Purdie Feb. 3, 2022, 10:01 p.m. UTC | #3
On Thu, 2022-02-03 at 13:58 -0800, Saul Wold wrote:
> 
> On 2/3/22 13:24, Richard Purdie wrote:
> > On Thu, 2022-02-03 at 09:07 -0800, Saul Wold wrote:
> > > When a file can not be identified by checksum and they contain an SPDX
> > > License-Identifier tag, use it as the found license.
> > > 
> > > [YOCTO #14529]
> > > 
> > > Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags
> > > 
> > > Signed-off-by: Saul Wold <saul.wold@windriver.com>
> > > ---
> > >   scripts/lib/recipetool/create.py | 16 +++++++++++-----
> > >   1 file changed, 11 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
> > > index 507a230511..9149c2d94f 100644
> > > --- a/scripts/lib/recipetool/create.py
> > > +++ b/scripts/lib/recipetool/create.py
> > > @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
> > >       for licfile in sorted(licfiles):
> > >           md5value = bb.utils.md5_file(licfile)
> > >           license = md5sums.get(md5value, None)
> > > +        license_list = []
> > >           if not license:
> > >               license, crunched_md5, lictext = crunch_license(licfile)
> > >               if lictext and not license:
> > > -                license = 'Unknown'
> > > -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> > > -                    "and replace `Unknown` with the license:\n" \
> > > -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> > > -        if license:
> > > +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
> > > +                license_list = re.findall(spdx_re, "\n".join(lictext))
> > > +                if not license_list:
> > > +                    license_list.append('Unknown')
> > > +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> > > +                        "and replace `Unknown` with the license:\n" \
> > > +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> > > +        else:
> > > +            license_list.append(license)
> > > +        for license in license_list:
> > >               licenses.append((license, os.path.relpath(licfile, srctree), md5value))
> > >   
> > >       # FIXME should we grab at least one source file with a license header and add that too?
> > 
> > I think to close this bug the code may need to go one step further and
> > effectively grep over the source tree.
> > 
> > We'd probably want to list the value of any SPDX-License-Identifier: header
> > found in any of the source files for the user to then decide upon?
> > 
> That's moving in to the create-spdx.bbclass territory I think. The 
> change would need to be much larger. and I will likely have to shelve 
> for a while.

This isn't related to create-spdx.

> 
> > Or am I misunderstanding?
> > 
> Maybe it's my misunderstanding, Tim has mentioned the LICENSE related 
> files in the bug report.

Right, we want to "guess" what the right LICENSE is for the new recipe. To do
that wouldn't we scan all the source for SPDX-License-Identifier: lines in the
headers, add those all together and suggest that as the LICENSE field?

Cheers,

Richard
Stefan Herbrechtsmeier Feb. 4, 2022, 8:11 a.m. UTC | #4
Hi Saul,

Am 03.02.2022 um 18:07 schrieb Saul Wold via lists.openembedded.org:
> When a file can not be identified by checksum and they contain an SPDX
> License-Identifier tag, use it as the found license.
> 
> [YOCTO #14529]
> 
> Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags

Can you please give an example for an project with use a 
SPDX-License-Identifier inside a license file.


> Signed-off-by: Saul Wold <saul.wold@windriver.com>
> ---
>   scripts/lib/recipetool/create.py | 16 +++++++++++-----
>   1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
> index 507a230511..9149c2d94f 100644
> --- a/scripts/lib/recipetool/create.py
> +++ b/scripts/lib/recipetool/create.py
> @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
>       for licfile in sorted(licfiles):
>           md5value = bb.utils.md5_file(licfile)
>           license = md5sums.get(md5value, None)
> +        license_list = []

Could you please use an other name. We already have licenses and it is 
hard to distinguish the difference between licenses and license_list.

>           if not license:
>               license, crunched_md5, lictext = crunch_license(licfile)
>               if lictext and not license:
> -                license = 'Unknown'
> -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> -                    "and replace `Unknown` with the license:\n" \
> -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> -        if license:
> +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
> +                license_list = re.findall(spdx_re, "\n".join(lictext))
> +                if not license_list:
> +                    license_list.append('Unknown')
> +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> +                        "and replace `Unknown` with the license:\n" \
> +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> +        else:
> +            license_list.append(license)
> +        for license in license_list:
>               licenses.append((license, os.path.relpath(licfile, srctree), md5value))
>   
>       # FIXME should we grab at least one source file with a license header and add that too?

Regards
   Stefan
Stefan Herbrechtsmeier Feb. 4, 2022, 9:05 a.m. UTC | #5
Hi Richard,

Am 03.02.2022 um 22:24 schrieb Richard Purdie via lists.openembedded.org:
> On Thu, 2022-02-03 at 09:07 -0800, Saul Wold wrote:
>> When a file can not be identified by checksum and they contain an SPDX
>> License-Identifier tag, use it as the found license.
>>
>> [YOCTO #14529]
>>
>> Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags
>>
>> Signed-off-by: Saul Wold <saul.wold@windriver.com>
>> ---
>>   scripts/lib/recipetool/create.py | 16 +++++++++++-----
>>   1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
>> index 507a230511..9149c2d94f 100644
>> --- a/scripts/lib/recipetool/create.py
>> +++ b/scripts/lib/recipetool/create.py
>> @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
>>       for licfile in sorted(licfiles):
>>           md5value = bb.utils.md5_file(licfile)
>>           license = md5sums.get(md5value, None)
>> +        license_list = []
>>           if not license:
>>               license, crunched_md5, lictext = crunch_license(licfile)
>>               if lictext and not license:
>> -                license = 'Unknown'
>> -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
>> -                    "and replace `Unknown` with the license:\n" \
>> -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
>> -        if license:
>> +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
>> +                license_list = re.findall(spdx_re, "\n".join(lictext))
>> +                if not license_list:
>> +                    license_list.append('Unknown')
>> +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
>> +                        "and replace `Unknown` with the license:\n" \
>> +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
>> +        else:
>> +            license_list.append(license)
>> +        for license in license_list:
>>               licenses.append((license, os.path.relpath(licfile, srctree), md5value))
>>   
>>       # FIXME should we grab at least one source file with a license header and add that too?
> 
> I think to close this bug the code may need to go one step further and
> effectively grep over the source tree.

Please keep in mind that we need a full license text and not only the 
license name for license compliance. The current function only search 
for license files with license text.

> We'd probably want to list the value of any SPDX-License-Identifier: header
> found in any of the source files for the user to then decide upon?

I think this is an other feature like a license checker because if you 
have a SPDX-License-Identifier without a license text you have a license 
violation.

This brings us to the problem that this code will interpret a file with 
only a SPDX-License-Identifier as a license file with license text.

Regards
   Stefan
Richard Purdie Feb. 4, 2022, 1:41 p.m. UTC | #6
On Fri, 2022-02-04 at 10:05 +0100, Stefan Herbrechtsmeier wrote:
> Hi Richard,
> 
> Am 03.02.2022 um 22:24 schrieb Richard Purdie via lists.openembedded.org:
> > On Thu, 2022-02-03 at 09:07 -0800, Saul Wold wrote:
> > > When a file can not be identified by checksum and they contain an SPDX
> > > License-Identifier tag, use it as the found license.
> > > 
> > > [YOCTO #14529]
> > > 
> > > Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags
> > > 
> > > Signed-off-by: Saul Wold <saul.wold@windriver.com>
> > > ---
> > >   scripts/lib/recipetool/create.py | 16 +++++++++++-----
> > >   1 file changed, 11 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
> > > index 507a230511..9149c2d94f 100644
> > > --- a/scripts/lib/recipetool/create.py
> > > +++ b/scripts/lib/recipetool/create.py
> > > @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
> > >       for licfile in sorted(licfiles):
> > >           md5value = bb.utils.md5_file(licfile)
> > >           license = md5sums.get(md5value, None)
> > > +        license_list = []
> > >           if not license:
> > >               license, crunched_md5, lictext = crunch_license(licfile)
> > >               if lictext and not license:
> > > -                license = 'Unknown'
> > > -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> > > -                    "and replace `Unknown` with the license:\n" \
> > > -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> > > -        if license:
> > > +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
> > > +                license_list = re.findall(spdx_re, "\n".join(lictext))
> > > +                if not license_list:
> > > +                    license_list.append('Unknown')
> > > +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
> > > +                        "and replace `Unknown` with the license:\n" \
> > > +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
> > > +        else:
> > > +            license_list.append(license)
> > > +        for license in license_list:
> > >               licenses.append((license, os.path.relpath(licfile, srctree), md5value))
> > >   
> > >       # FIXME should we grab at least one source file with a license header and add that too?
> > 
> > I think to close this bug the code may need to go one step further and
> > effectively grep over the source tree.
> 
> Please keep in mind that we need a full license text and not only the 
> license name for license compliance. The current function only search 
> for license files with license text.
> 
> > We'd probably want to list the value of any SPDX-License-Identifier: header
> > found in any of the source files for the user to then decide upon?
> 
> I think this is an other feature like a license checker because if you 
> have a SPDX-License-Identifier without a license text you have a license 
> violation.
> 
> This brings us to the problem that this code will interpret a file with 
> only a SPDX-License-Identifier as a license file with license text.

As I understand it the tool is there to help write a recipe so filling out
LICENSE and highlighting a missing full license text would be a valid approach
for the tool and helpful to the user?

It certainly isn't intended as full validation, just intended to assist the
creation of a recipe.

Cheers,

Richard
Stefan Herbrechtsmeier Feb. 4, 2022, 2:40 p.m. UTC | #7
Am 04.02.2022 um 14:41 schrieb Richard Purdie:
> On Fri, 2022-02-04 at 10:05 +0100, Stefan Herbrechtsmeier wrote:
>> Am 03.02.2022 um 22:24 schrieb Richard Purdie via lists.openembedded.org:
>>> On Thu, 2022-02-03 at 09:07 -0800, Saul Wold wrote:
>>>> When a file can not be identified by checksum and they contain an SPDX
>>>> License-Identifier tag, use it as the found license.
>>>>
>>>> [YOCTO #14529]
>>>>
>>>> Tested with LICENSE files that contain 1 or more SPDX-License-Identifier tags
>>>>
>>>> Signed-off-by: Saul Wold <saul.wold@windriver.com>
>>>> ---
>>>>    scripts/lib/recipetool/create.py | 16 +++++++++++-----
>>>>    1 file changed, 11 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
>>>> index 507a230511..9149c2d94f 100644
>>>> --- a/scripts/lib/recipetool/create.py
>>>> +++ b/scripts/lib/recipetool/create.py
>>>> @@ -1221,14 +1221,20 @@ def guess_license(srctree, d):
>>>>        for licfile in sorted(licfiles):
>>>>            md5value = bb.utils.md5_file(licfile)
>>>>            license = md5sums.get(md5value, None)
>>>> +        license_list = []
>>>>            if not license:
>>>>                license, crunched_md5, lictext = crunch_license(licfile)
>>>>                if lictext and not license:
>>>> -                license = 'Unknown'
>>>> -                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
>>>> -                    "and replace `Unknown` with the license:\n" \
>>>> -                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
>>>> -        if license:
>>>> +                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
>>>> +                license_list = re.findall(spdx_re, "\n".join(lictext))
>>>> +                if not license_list:
>>>> +                    license_list.append('Unknown')
>>>> +                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
>>>> +                        "and replace `Unknown` with the license:\n" \
>>>> +                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
>>>> +        else:
>>>> +            license_list.append(license)
>>>> +        for license in license_list:
>>>>                licenses.append((license, os.path.relpath(licfile, srctree), md5value))
>>>>    
>>>>        # FIXME should we grab at least one source file with a license header and add that too?
>>>
>>> I think to close this bug the code may need to go one step further and
>>> effectively grep over the source tree.
>>
>> Please keep in mind that we need a full license text and not only the
>> license name for license compliance. The current function only search
>> for license files with license text.
>>
>>> We'd probably want to list the value of any SPDX-License-Identifier: header
>>> found in any of the source files for the user to then decide upon?
>>
>> I think this is an other feature like a license checker because if you
>> have a SPDX-License-Identifier without a license text you have a license
>> violation.
>>
>> This brings us to the problem that this code will interpret a file with
>> only a SPDX-License-Identifier as a license file with license text.
> 
> As I understand it the tool is there to help write a recipe so filling out
> LICENSE and highlighting a missing full license text would be a valid approach
> for the tool and helpful to the user?

Yes, but we should distinguish between license files which are guess via 
hash of the content and SPDX-License-Identifier which labels the source 
code’s license. In this case the SPDX-License-Identifier is non-material 
text from a license file and should be filtered out inside 
crunch_license function.

The collection of all used licenses via SPDX-License-Identifier is an 
additional feature and we need a warning if a SPDX-License-Identifier 
exists without license file.

> It certainly isn't intended as full validation, just intended to assist the
> creation of a recipe.

But this patch is an regress because it doesn't distinguish between a 
license file with a known hash and a mostly empty file with a 
SPDX-License-Identifier.

Regards
   Stefan

Patch

diff --git a/scripts/lib/recipetool/create.py b/scripts/lib/recipetool/create.py
index 507a230511..9149c2d94f 100644
--- a/scripts/lib/recipetool/create.py
+++ b/scripts/lib/recipetool/create.py
@@ -1221,14 +1221,20 @@  def guess_license(srctree, d):
     for licfile in sorted(licfiles):
         md5value = bb.utils.md5_file(licfile)
         license = md5sums.get(md5value, None)
+        license_list = []
         if not license:
             license, crunched_md5, lictext = crunch_license(licfile)
             if lictext and not license:
-                license = 'Unknown'
-                logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
-                    "and replace `Unknown` with the license:\n" \
-                    "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
-        if license:
+                spdx_re = re.compile('SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
+                license_list = re.findall(spdx_re, "\n".join(lictext))
+                if not license_list:
+                    license_list.append('Unknown')
+                    logger.info("Please add the following line for '%s' to a 'lib/recipetool/licenses.csv' " \
+                        "and replace `Unknown` with the license:\n" \
+                        "%s,Unknown" % (os.path.relpath(licfile, srctree), md5value))
+        else:
+            license_list.append(license)
+        for license in license_list:
             licenses.append((license, os.path.relpath(licfile, srctree), md5value))
 
     # FIXME should we grab at least one source file with a license header and add that too?