[V3] overview-manual: document hash equivalence

Message ID 20220107190840.784216-1-michael.opdenacker@bootlin.com
State New
Headers show
Series [V3] overview-manual: document hash equivalence | expand

Commit Message

Michael Opdenacker Jan. 7, 2022, 7:08 p.m. UTC
Signed-off-by: Michael Opdenacker <michael.opdenacker@bootlin.com>
---
 documentation/overview-manual/concepts.rst | 126 +++++++++++++++++++++
 1 file changed, 126 insertions(+)

Comments

Ulrich Ölmann Jan. 10, 2022, 6:29 a.m. UTC | #1
Hi Michael,

a good summary of hash equivalence! Perhaps you could add somewhere that
reproducibility is a key ingredient for the stability of the task's
output hash, hence the whole mechanism's efficiency strongly depends on
reproducibility.

Please find some minor fixes further down.

On Fri, Jan 07 2022 at 20:08 +0100, "Michael Opdenacker" <michael.opdenacker@bootlin.com> wrote:
> Signed-off-by: Michael Opdenacker <michael.opdenacker@bootlin.com>
> ---
>  documentation/overview-manual/concepts.rst | 126 +++++++++++++++++++++
>  1 file changed, 126 insertions(+)
>
> diff --git a/documentation/overview-manual/concepts.rst b/documentation/overview-manual/concepts.rst
> index 6f8a3def69..781ba1b070 100644
> --- a/documentation/overview-manual/concepts.rst
> +++ b/documentation/overview-manual/concepts.rst
> @@ -1938,6 +1938,132 @@ another reason why a task-based approach is preferred over a
>  recipe-based approach, which would have to install the output from every
>  task.
>  
> +Hash Equivalence
> +----------------
> +
> +The above section explained how BitBake skips the execution of tasks
> +which output can already be found in the Shared State cache.

s/which/whose/

> +
> +During a build, it may often be the case that the output / result of a task might
> +be unchanged despite changes in the task's input values. An example might be
> +whitespace changes in some input C code. In project terms, this is what we define
> +as "equivalence".
> +
> +To keep track of such equivalence, BitBake has to manage three hashes
> +for each task:
> +
> +- The *task hash* explained earlier: computed from the recipe metadata,
> +  the task code and the task hash values from its dependencies.
> +  When changes are made, these task hashes are therefore modified,
> +  causing the task to re-execute. The task hashes of tasks depending on this
> +  task are therefore modified too, causing the whole dependency
> +  chain to re-execute.
> +
> +- The *output hash*, a new hash computed from the output of Shared State tasks,
> +  tasks that save their resulting output to a Shared State tarball.
> +  The mapping between the task hash and its output hash is reported
> +  to a new *Hash Equivalence* server. This mapping is stored in a database
> +  by the server for future reference.
> +
> +- The *unihash*, a new hash, initially set to the task hash for the task.
> +  This is used to track the *unicity* of task output, and we will explain
> +  how its value is maintained.
> +
> +When Hash Equivalence is enabled, BitBake computes the task hash
> +for each task by using the unihash of its dependencies, instead
> +of their task hash.
> +
> +Now, imagine that a Shared State task is modified because of a change in
> +its code or metadata, or because of a change in its dependencies.
> +Since this modifies its task hash, this task will need re-executing.

s/re-executing/re-execution/

> +Its output hash will therefore be computed again.
> +
> +Then, the new mapping between the new task hash and its output hash
> +will be reported to the Hash Equivalence server. The server will
> +let BitBake know whether this output hash is the same as a previously
> +reported output hash, for a different task hash.
> +
> +If the output hash is already known, BitBake will update the task's
> +unihash to match the original task hash that generated that output.
> +Thanks to this, the depending tasks will keep a previously recorded
> +task hash, and BitBake will be able to retrieve their output from
> +the Shared State cache, instead of re-executing them. Similarly, the
> +output of further downstream tasks can also be retrieved from Shared
> +Shate.
> +
> +If the output hash is unknown, a new entry will be created on the Hash
> +Equivalence server, matching the task hash to that output.
> +The depending tasks, still having a new task hash because of the
> +change, will need to re-execute as expected. The change propagates
> +to the depending tasks.
> +
> +To summarize, when Hash Equivalence is enabled, a change in one of the
> +tasks in BitBake's run queue doesn't have to propagate to all the
> +downstream tasks that depend on the output of this task, causing a
> +full rebuild of such tasks, and so on with the next depending tasks.
> +Instead, when the output of this task remains identical to previously
> +recorded output, BitBake can safely retrieve all the downstream
> +task output from the Shared State cache.
> +
> +This applies to multiple scenarios:
> +
> +-  A "trivial" change to a recipe that doesn't impact its generated output,
> +   such as whitespace changes, modifications to unused code paths or
> +   in the ordering of variables.
> +
> +-  Shared library updates, for example to fix a security vulnerability.
> +   For sure, the programs using such a library should be rebuilt, but
> +   their new binaries should remain identical. The corresponding tasks should
> +   have a different output hash because of the change in the hash of their
> +   library dependency, but thanks to their output being identical, Hash
> +   Equivalence will stop the propagation down the dependency chain.
> +
> +-  Native tool updates. Though the depending tasks should be rebuilt,
> +   it's likely that they will generate the same output and be marked
> +   as equivalent.
> +
> +This mechanism is enabled by default in Poky, and is controlled by three
> +variables:
> +
> +-  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote Hash
> +   Equivalence server to use.
> +
> +-  :term:`BB_HASHSERVE_UPSTREAM`, when ``BB_HASHSERVE = "auto"``,
> +   allowing to connect the local server to an upstream one.
> +
> +-  :term:`bitbake:BB_SIGNATURE_HANDLER`, which must be set  to ``OEEquivHash``.
> +
> +Therefore, the default configuration in Poky corresponds to the
> +below settings::
> +
> +   BB_HASHSERVE = "auto"
> +   BB_SIGNATURE_HANDLER = "OEEquivHash"
> +
> +Rather than starting a local server, another possibility is to rely
> +on a Hash Equivalence server on a network, by setting::
> +
> +   BB_HASHSERVE = "<HOSTNAME>:<PORT>"
> +
> +.. note::
> +
> +   The shared Hash Equivalence server needs to be maintained together with the
> +   Share State cache. Otherwise, the server could report Shared State hashes

s/Share State cache/Shared State cache/

Best regards
Ulrich


> +   that only exist on specific clients.
> +
> +   We therefore recommend that one Hash Equivalence server be set up to
> +   correspond with a given Shared State cache, and to start this server
> +   in *read-only mode*, so that it doesn't store equivalences for
> +   Shared State caches that are local to clients.
> +
> +   See the :term:`BB_HASHSERVE` reference for details about starting
> +   a Hash Equivalence server.
> +
> +See the `video <https://www.youtube.com/watch?v=zXEdqGS62Wc>`__
> +of Joshua Watt's `Hash Equivalence and Reproducible Builds
> +<https://elinux.org/images/3/37/Hash_Equivalence_and_Reproducible_Builds.pdf>`__
> +presentation at ELC 2020 for a very synthetic introduction to the
> +Hash Equivalence implementation in the Yocto Project.
> +
>  Automatically Added Runtime Dependencies
>  ========================================
Ulrich Ölmann Jan. 10, 2022, 11:25 a.m. UTC | #2
Hi Michael,

On Mon, Jan 10 2022 at 11:49 +0100, Michael Opdenacker <michael.opdenacker@bootlin.com> wrote:
> Hi Ulrich,
>
> Many thanks for the review!
>
> On 1/10/22 7:29 AM, Ulrich Ölmann wrote:
>> Hi Michael,
>>
>> a good summary of hash equivalence! Perhaps you could add somewhere that
>> reproducibility is a key ingredient for the stability of the task's
>> output hash, hence the whole mechanism's efficiency strongly depends on
>> reproducibility.
>
>
> This is a very good idea indeed. I will add this.
>
>>
>>>
>>> +Hash Equivalence
>>> +----------------
>>> +
>>> +The above section explained how BitBake skips the execution of tasks
>>> +which output can already be found in the Shared State cache.
>> s/which/whose/
>
>
> I thought that "whose" was only meant to refer to living things. Would
> you have a rule supporting what you're proposing?

It simply felt wrong to me, as there is no expression of possession. But
I am not 100% sure as I am no native speaker. The only thing that I
could come up with after searching the web is [1]. Alternatively, what
do you think of the following replacement

  s/which output/the output of which/ ?

[1] https://www.englishgrammar.org/relative-pronouns-3/

>>> +
>>> +Now, imagine that a Shared State task is modified because of a change in
>>> +its code or metadata, or because of a change in its dependencies.
>>> +Since this modifies its task hash, this task will need re-executing.
>> s/re-executing/re-execution/
>
>
> I used "needed re-executing" as in "this wall needs painting", using
> "need" plus a gerund to express a passive meaning. This doesn't sound
> wrong to me...

Ah, okay, I read it as an infinitive - now I understand, and your
version is correct, of course.

Best regards
Ulrich


>> s/Share State cache/Shared State cache/
>
>
> Oops, fixed, thanks!
> Thanks again
> Michael.
Michael Opdenacker Jan. 10, 2022, 1:22 p.m. UTC | #3
Richard, Ulrich,

On 1/10/22 12:30 PM, Richard Purdie wrote:
>>>> +Hash Equivalence
>>>> +----------------
>>>> +
>>>> +The above section explained how BitBake skips the execution of tasks
>>>> +which output can already be found in the Shared State cache.
>>> s/which/whose/
>>
>> I thought that "whose" was only meant to refer to living things. Would
>> you have a rule supporting what you're proposing?
> Whilst it isn't technically correct, it has been used since medieval times :)
>
> https://www.merriam-webster.com/words-at-play/whose-used-for-inanimate-objects


Thanks for the clarifications. I now understand that my usage of "which"
in this context (instead of "whose") was just wrong. That's a bit
embarrassing, but I'm happy to be corrected.
Thanks again!
Michael.
Joshua Watt Jan. 11, 2022, 9:23 p.m. UTC | #4
Micahel,

A few minor comments. I'm not sure if they add to the clarity of the
documentation, so feel free to omit any changes that are not helpful

On Fri, Jan 7, 2022 at 1:08 PM Michael Opdenacker
<michael.opdenacker@bootlin.com> wrote:
>
> Signed-off-by: Michael Opdenacker <michael.opdenacker@bootlin.com>
> ---
>  documentation/overview-manual/concepts.rst | 126 +++++++++++++++++++++
>  1 file changed, 126 insertions(+)
>
> diff --git a/documentation/overview-manual/concepts.rst b/documentation/overview-manual/concepts.rst
> index 6f8a3def69..781ba1b070 100644
> --- a/documentation/overview-manual/concepts.rst
> +++ b/documentation/overview-manual/concepts.rst
> @@ -1938,6 +1938,132 @@ another reason why a task-based approach is preferred over a
>  recipe-based approach, which would have to install the output from every
>  task.
>
> +Hash Equivalence
> +----------------
> +
> +The above section explained how BitBake skips the execution of tasks
> +which output can already be found in the Shared State cache.
> +
> +During a build, it may often be the case that the output / result of a task might
> +be unchanged despite changes in the task's input values. An example might be
> +whitespace changes in some input C code. In project terms, this is what we define
> +as "equivalence".
> +
> +To keep track of such equivalence, BitBake has to manage three hashes
> +for each task:
> +
> +- The *task hash* explained earlier: computed from the recipe metadata,
> +  the task code and the task hash values from its dependencies.
> +  When changes are made, these task hashes are therefore modified,
> +  causing the task to re-execute. The task hashes of tasks depending on this
> +  task are therefore modified too, causing the whole dependency
> +  chain to re-execute.
> +
> +- The *output hash*, a new hash computed from the output of Shared State tasks,
> +  tasks that save their resulting output to a Shared State tarball.
> +  The mapping between the task hash and its output hash is reported
> +  to a new *Hash Equivalence* server. This mapping is stored in a database
> +  by the server for future reference.
> +
> +- The *unihash*, a new hash, initially set to the task hash for the task.
> +  This is used to track the *unicity* of task output, and we will explain

Is "unicity" a word? Would "equivalence" be better?

> +  how its value is maintained.
> +
> +When Hash Equivalence is enabled, BitBake computes the task hash
> +for each task by using the unihash of its dependencies, instead
> +of their task hash.

Mention that bitbake queries the hash equivalence server at parse time
to look up the unihash for each taskhash it knows about.... this is
cached, to speed up reparse but I don't know if that's worth
mentioning?

> +
> +Now, imagine that a Shared State task is modified because of a change in
> +its code or metadata, or because of a change in its dependencies.
> +Since this modifies its task hash, this task will need re-executing.
> +Its output hash will therefore be computed again.
> +
> +Then, the new mapping between the new task hash and its output hash
> +will be reported to the Hash Equivalence server. The server will
> +let BitBake know whether this output hash is the same as a previously
> +reported output hash, for a different task hash.
> +
> +If the output hash is already known, BitBake will update the task's
> +unihash to match the original task hash that generated that output.

The server reports this by returning the desired unihash for the task
that just finished executing. If it is the same as the previous
unihash that bitbake new for the task, bitbake does nothing. If it is
a new unihash, bitbake propagates the change to the unihash through
the task graph

> +Thanks to this, the depending tasks will keep a previously recorded
> +task hash, and BitBake will be able to retrieve their output from
> +the Shared State cache, instead of re-executing them. Similarly, the
> +output of further downstream tasks can also be retrieved from Shared
> +Shate.
> +
> +If the output hash is unknown, a new entry will be created on the Hash
> +Equivalence server, matching the task hash to that output.
> +The depending tasks, still having a new task hash because of the
> +change, will need to re-execute as expected. The change propagates
> +to the depending tasks.
> +
> +To summarize, when Hash Equivalence is enabled, a change in one of the
> +tasks in BitBake's run queue doesn't have to propagate to all the
> +downstream tasks that depend on the output of this task, causing a
> +full rebuild of such tasks, and so on with the next depending tasks.
> +Instead, when the output of this task remains identical to previously
> +recorded output, BitBake can safely retrieve all the downstream
> +task output from the Shared State cache.
> +
> +This applies to multiple scenarios:
> +
> +-  A "trivial" change to a recipe that doesn't impact its generated output,
> +   such as whitespace changes, modifications to unused code paths or
> +   in the ordering of variables.
> +
> +-  Shared library updates, for example to fix a security vulnerability.
> +   For sure, the programs using such a library should be rebuilt, but
> +   their new binaries should remain identical. The corresponding tasks should
> +   have a different output hash because of the change in the hash of their
> +   library dependency, but thanks to their output being identical, Hash
> +   Equivalence will stop the propagation down the dependency chain.
> +
> +-  Native tool updates. Though the depending tasks should be rebuilt,
> +   it's likely that they will generate the same output and be marked
> +   as equivalent.
> +
> +This mechanism is enabled by default in Poky, and is controlled by three
> +variables:
> +
> +-  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote Hash
> +   Equivalence server to use.
> +
> +-  :term:`BB_HASHSERVE_UPSTREAM`, when ``BB_HASHSERVE = "auto"``,
> +   allowing to connect the local server to an upstream one.
> +
> +-  :term:`bitbake:BB_SIGNATURE_HANDLER`, which must be set  to ``OEEquivHash``.
> +
> +Therefore, the default configuration in Poky corresponds to the
> +below settings::
> +
> +   BB_HASHSERVE = "auto"
> +   BB_SIGNATURE_HANDLER = "OEEquivHash"
> +
> +Rather than starting a local server, another possibility is to rely
> +on a Hash Equivalence server on a network, by setting::
> +
> +   BB_HASHSERVE = "<HOSTNAME>:<PORT>"
> +
> +.. note::
> +
> +   The shared Hash Equivalence server needs to be maintained together with the
> +   Share State cache. Otherwise, the server could report Shared State hashes
> +   that only exist on specific clients.
> +
> +   We therefore recommend that one Hash Equivalence server be set up to
> +   correspond with a given Shared State cache, and to start this server
> +   in *read-only mode*, so that it doesn't store equivalences for
> +   Shared State caches that are local to clients.

I think this could be a little more clear, like: If you have an sstate
cache that you are using as a read-only mirror (e.g. using
SSTATE_MIRRORS) that was generated using hash equivalence, you should
also publish the hash equivalence database using a read-only hash
equiv server, otherwise you will get very poor sstate use from the
mirror.

> +
> +   See the :term:`BB_HASHSERVE` reference for details about starting
> +   a Hash Equivalence server.
> +
> +See the `video <https://www.youtube.com/watch?v=zXEdqGS62Wc>`__
> +of Joshua Watt's `Hash Equivalence and Reproducible Builds
> +<https://elinux.org/images/3/37/Hash_Equivalence_and_Reproducible_Builds.pdf>`__
> +presentation at ELC 2020 for a very synthetic introduction to the
> +Hash Equivalence implementation in the Yocto Project.
> +
>  Automatically Added Runtime Dependencies
>  ========================================
>
> --
> 2.25.1
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#2360): https://lists.yoctoproject.org/g/docs/message/2360
> Mute This Topic: https://lists.yoctoproject.org/mt/88268729/3616693
> Group Owner: docs+owner@lists.yoctoproject.org
> Unsubscribe: https://lists.yoctoproject.org/g/docs/unsub [JPEWhacker@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Michael Opdenacker Jan. 13, 2022, 10:50 a.m. UTC | #5
Hi Joshua,

Many thanks for the review. See my comments below...

On 1/11/22 10:23 PM, Joshua Watt wrote:
>
>> +
>> +- The *unihash*, a new hash, initially set to the task hash for the task.
>> +  This is used to track the *unicity* of task output, and we will explain
> Is "unicity" a word? Would "equivalence" be better?


Well, yes, I found it in my 50-year old Oxford dictionary. However, it
sounds rare. What about "uniqueness" instead?
By the way, I wanted to help to understand and remember why this hash is
called "unihash". Didn't you have the word "unique" in mind when you
coined the term?

>
>> +  how its value is maintained.
>> +
>> +When Hash Equivalence is enabled, BitBake computes the task hash
>> +for each task by using the unihash of its dependencies, instead
>> +of their task hash.
> Mention that bitbake queries the hash equivalence server at parse time
> to look up the unihash for each taskhash it knows about.... this is
> cached, to speed up reparse but I don't know if that's worth
> mentioning?


This could be useful indeed. Actually, my current description only says
that the hash equivalence server stores the mapping between an output
hash and a task hash. How is the unihash stored?

>
>> +
>> +Now, imagine that a Shared State task is modified because of a change in
>> +its code or metadata, or because of a change in its dependencies.
>> +Since this modifies its task hash, this task will need re-executing.
>> +Its output hash will therefore be computed again.
>> +
>> +Then, the new mapping between the new task hash and its output hash
>> +will be reported to the Hash Equivalence server. The server will
>> +let BitBake know whether this output hash is the same as a previously
>> +reported output hash, for a different task hash.
>>> +
>>> +   See the :term:`BB_HASHSERVE` reference for details about starting
>>> +   a Hash Equivalence server.
>>> +
>>> +See the `video <https://www.youtube.com/watch?v=zXEdqGS62Wc>`__
>>> +of Joshua Watt's `Hash Equivalence and Reproducible Builds
>>> +<https://elinux.org/images/3/37/Hash_Equivalence_and_Reproducible_Builds.pdf>`__
>>> +presentation at ELC 2020 for a very synthetic introduction to the
>>> +Hash Equivalence implementation in the Yocto Project.
>>> +
>>>  Automatically Added Runtime Dependencies
>>>  ========================================
>>>
>>> --
>>> 2.25.1
>>>
>>>
>>>
>>>
>>>
>>> -=-=-=-=-=-=-=-=-=-=-=-
>>> Links: You receive all messages sent to this group.
>>> View/Reply Online (#2386): https://lists.yoctoproject.org/g/docs/message/2386
>>> Mute This Topic: https://lists.yoctoproject.org/mt/88268729/1051844
>>> Group Owner: docs+owner@lists.yoctoproject.org
>>> Unsubscribe: https://lists.yoctoproject.org/g/docs/unsub [michael.opdenacker@bootlin.com]
>>> -=-=-=-=-=-=-=-=-=-=-=-
>>>
>> +
>> +If the output hash is already known, BitBake will update the task's
>> +unihash to match the original task hash that generated that output.
> The server reports this by returning the desired unihash for the task
> that just finished executing. If it is the same as the previous
> unihash that bitbake new for the task, bitbake does nothing. If it is
> a new unihash, bitbake propagates the change to the unihash through
> the task graph


Well, I assume you could get a unihash that's not the previous unihash,
but a former one, right?
I tried to catch this case in the V4 of my patch which got merged:
https://git.yoctoproject.org/yocto-docs/tree/documentation/overview-manual/concepts.rst#n1986

Anyway, in the light of your explanations, my text looks incorrect
because I'm saying that it's the output hash that gets compared, not the
unihash.

>
>> +Thanks to this, the depending tasks will keep a previously recorded
>> +task hash, and BitBake will be able to retrieve their output from
>> +the Shared State cache, instead of re-executing them. Similarly, the
>> +output of further downstream tasks can also be retrieved from Shared
>> +Shate.
>> +
>> +If the output hash is unknown, a new entry will be created on the Hash
>> +Equivalence server, matching the task hash to that output.
>> +The depending tasks, still having a new task hash because of the
>> +change, will need to re-execute as expected. The change propagates
>> +to the depending tasks.
>> +
>> +To summarize, when Hash Equivalence is enabled, a change in one of the
>> +tasks in BitBake's run queue doesn't have to propagate to all the
>> +downstream tasks that depend on the output of this task, causing a
>> +full rebuild of such tasks, and so on with the next depending tasks.
>> +Instead, when the output of this task remains identical to previously
>> +recorded output, BitBake can safely retrieve all the downstream
>> +task output from the Shared State cache.
>> +
>> +This applies to multiple scenarios:
>> +
>> +-  A "trivial" change to a recipe that doesn't impact its generated output,
>> +   such as whitespace changes, modifications to unused code paths or
>> +   in the ordering of variables.
>> +
>> +-  Shared library updates, for example to fix a security vulnerability.
>> +   For sure, the programs using such a library should be rebuilt, but
>> +   their new binaries should remain identical. The corresponding tasks should
>> +   have a different output hash because of the change in the hash of their
>> +   library dependency, but thanks to their output being identical, Hash
>> +   Equivalence will stop the propagation down the dependency chain.
>> +
>> +-  Native tool updates. Though the depending tasks should be rebuilt,
>> +   it's likely that they will generate the same output and be marked
>> +   as equivalent.
>> +
>> +This mechanism is enabled by default in Poky, and is controlled by three
>> +variables:
>> +
>> +-  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote Hash
>> +   Equivalence server to use.
>> +
>> +-  :term:`BB_HASHSERVE_UPSTREAM`, when ``BB_HASHSERVE = "auto"``,
>> +   allowing to connect the local server to an upstream one.
>> +
>> +-  :term:`bitbake:BB_SIGNATURE_HANDLER`, which must be set  to ``OEEquivHash``.
>> +
>> +Therefore, the default configuration in Poky corresponds to the
>> +below settings::
>> +
>> +   BB_HASHSERVE = "auto"
>> +   BB_SIGNATURE_HANDLER = "OEEquivHash"
>> +
>> +Rather than starting a local server, another possibility is to rely
>> +on a Hash Equivalence server on a network, by setting::
>> +
>> +   BB_HASHSERVE = "<HOSTNAME>:<PORT>"
>> +
>> +.. note::
>> +
>> +   The shared Hash Equivalence server needs to be maintained together with the
>> +   Share State cache. Otherwise, the server could report Shared State hashes
>> +   that only exist on specific clients.
>> +
>> +   We therefore recommend that one Hash Equivalence server be set up to
>> +   correspond with a given Shared State cache, and to start this server
>> +   in *read-only mode*, so that it doesn't store equivalences for
>> +   Shared State caches that are local to clients.
> I think this could be a little more clear, like: If you have an sstate
> cache that you are using as a read-only mirror (e.g. using
> SSTATE_MIRRORS) that was generated using hash equivalence, you should
> also publish the hash equivalence database using a read-only hash
> equiv server, otherwise you will get very poor sstate use from the
> mirror.

This sounds better indeed, thanks. I'll update this note.
Thanks again, and thanks in advance for the few clarifications about how
the unihash is stored.
Cheers
Michael.
Joshua Watt Jan. 13, 2022, 5:17 p.m. UTC | #6
On Thu, Jan 13, 2022 at 4:50 AM Michael Opdenacker
<michael.opdenacker@bootlin.com> wrote:
>
> Hi Joshua,
>
> Many thanks for the review. See my comments below...
>
> On 1/11/22 10:23 PM, Joshua Watt wrote:
> >
> >> +
> >> +- The *unihash*, a new hash, initially set to the task hash for the task.
> >> +  This is used to track the *unicity* of task output, and we will explain
> > Is "unicity" a word? Would "equivalence" be better?
>
>
> Well, yes, I found it in my 50-year old Oxford dictionary. However, it
> sounds rare. What about "uniqueness" instead?
> By the way, I wanted to help to understand and remember why this hash is
> called "unihash". Didn't you have the word "unique" in mind when you
> coined the term?

If memory serves it was "unified hash", the idea being that it
"unifies" multiple taskhashes together; I suppose in a sense it is
"unique".... I'm not really hung up on the name so much as how clearly
it's documented to the end user.

>
> >
> >> +  how its value is maintained.
> >> +
> >> +When Hash Equivalence is enabled, BitBake computes the task hash
> >> +for each task by using the unihash of its dependencies, instead
> >> +of their task hash.
> > Mention that bitbake queries the hash equivalence server at parse time
> > to look up the unihash for each taskhash it knows about.... this is
> > cached, to speed up reparse but I don't know if that's worth
> > mentioning?
>
>
> This could be useful indeed. Actually, my current description only says
> that the hash equivalence server stores the mapping between an output
> hash and a task hash. How is the unihash stored?

I'll describe it in some detail, and you can distill it down to the
important parts for documentation:

The server supports 2 operations. The first operation is a query to
decide what unihash should be used in place of a given taskhash
whenever bitbake doesn't already know (for example, during parsing).
we can call this operation "GET", and it looks like:

GET(taskhash) -> unihash

Every unique taskhash maps to only one unihash, but multiple taskhash
can map to the same unihash. The server tries to make sure the mapping
of a taskhash to unihash is consistent across time (that is, it won't
be different at a later query), but this is not guaranteed.


The second operation the server supports is reporting a new output
hash that corresponds to a given taskhash, and is called after sstate
generation (since that is when the outhash is known). This method will
return the (potentially new) unihash that corresponds to the given
taskhash. We can all this "REPORT" and it looks like:

REPORT(taskhash, outhash) -> unihash

The server will identify reports that have the same output hash, and
modify its state so that the taskhashes that correspond to those
outhashes report the same unihash, thereby making them equivalent.

>
> >
> >> +
> >> +Now, imagine that a Shared State task is modified because of a change in
> >> +its code or metadata, or because of a change in its dependencies.
> >> +Since this modifies its task hash, this task will need re-executing.
> >> +Its output hash will therefore be computed again.
> >> +
> >> +Then, the new mapping between the new task hash and its output hash
> >> +will be reported to the Hash Equivalence server. The server will
> >> +let BitBake know whether this output hash is the same as a previously
> >> +reported output hash, for a different task hash.
> >>> +
> >>> +   See the :term:`BB_HASHSERVE` reference for details about starting
> >>> +   a Hash Equivalence server.
> >>> +
> >>> +See the `video <https://www.youtube.com/watch?v=zXEdqGS62Wc>`__
> >>> +of Joshua Watt's `Hash Equivalence and Reproducible Builds
> >>> +<https://elinux.org/images/3/37/Hash_Equivalence_and_Reproducible_Builds.pdf>`__
> >>> +presentation at ELC 2020 for a very synthetic introduction to the
> >>> +Hash Equivalence implementation in the Yocto Project.
> >>> +
> >>>  Automatically Added Runtime Dependencies
> >>>  ========================================
> >>>
> >>> --
> >>> 2.25.1
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -=-=-=-=-=-=-=-=-=-=-=-
> >>> Links: You receive all messages sent to this group.
> >>> View/Reply Online (#2386): https://lists.yoctoproject.org/g/docs/message/2386
> >>> Mute This Topic: https://lists.yoctoproject.org/mt/88268729/1051844
> >>> Group Owner: docs+owner@lists.yoctoproject.org
> >>> Unsubscribe: https://lists.yoctoproject.org/g/docs/unsub [michael.opdenacker@bootlin.com]
> >>> -=-=-=-=-=-=-=-=-=-=-=-
> >>>
> >> +
> >> +If the output hash is already known, BitBake will update the task's
> >> +unihash to match the original task hash that generated that output.
> > The server reports this by returning the desired unihash for the task
> > that just finished executing. If it is the same as the previous
> > unihash that bitbake new for the task, bitbake does nothing. If it is
> > a new unihash, bitbake propagates the change to the unihash through
> > the task graph
>
>
> Well, I assume you could get a unihash that's not the previous unihash,
> but a former one, right?
> I tried to catch this case in the V4 of my patch which got merged:
> https://git.yoctoproject.org/yocto-docs/tree/documentation/overview-manual/concepts.rst#n1986
>
> Anyway, in the light of your explanations, my text looks incorrect
> because I'm saying that it's the output hash that gets compared, not the
> unihash.

Hopefully the above description is helpful: it's important to remember
that bitbake itself doesn't really "compare" for the purposes of
determining equivalence, it's just using the 2 operations above and
doing what the server says; the server drives the determination of
what's considered "equivalent".

Bitbake will compare the output of the REPORT operation to the
previous unihash it had for the input taskhash, but the purpose of
that is to determine if it needs to reconstruct the taskgraph with the
new unihash, not to decide "equivalence"


Hopefully that all helps. If you want more clarification, let me know
and I can provide it.

>
> >
> >> +Thanks to this, the depending tasks will keep a previously recorded
> >> +task hash, and BitBake will be able to retrieve their output from
> >> +the Shared State cache, instead of re-executing them. Similarly, the
> >> +output of further downstream tasks can also be retrieved from Shared
> >> +Shate.
> >> +
> >> +If the output hash is unknown, a new entry will be created on the Hash
> >> +Equivalence server, matching the task hash to that output.
> >> +The depending tasks, still having a new task hash because of the
> >> +change, will need to re-execute as expected. The change propagates
> >> +to the depending tasks.
> >> +
> >> +To summarize, when Hash Equivalence is enabled, a change in one of the
> >> +tasks in BitBake's run queue doesn't have to propagate to all the
> >> +downstream tasks that depend on the output of this task, causing a
> >> +full rebuild of such tasks, and so on with the next depending tasks.
> >> +Instead, when the output of this task remains identical to previously
> >> +recorded output, BitBake can safely retrieve all the downstream
> >> +task output from the Shared State cache.
> >> +
> >> +This applies to multiple scenarios:
> >> +
> >> +-  A "trivial" change to a recipe that doesn't impact its generated output,
> >> +   such as whitespace changes, modifications to unused code paths or
> >> +   in the ordering of variables.
> >> +
> >> +-  Shared library updates, for example to fix a security vulnerability.
> >> +   For sure, the programs using such a library should be rebuilt, but
> >> +   their new binaries should remain identical. The corresponding tasks should
> >> +   have a different output hash because of the change in the hash of their
> >> +   library dependency, but thanks to their output being identical, Hash
> >> +   Equivalence will stop the propagation down the dependency chain.
> >> +
> >> +-  Native tool updates. Though the depending tasks should be rebuilt,
> >> +   it's likely that they will generate the same output and be marked
> >> +   as equivalent.
> >> +
> >> +This mechanism is enabled by default in Poky, and is controlled by three
> >> +variables:
> >> +
> >> +-  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote Hash
> >> +   Equivalence server to use.
> >> +
> >> +-  :term:`BB_HASHSERVE_UPSTREAM`, when ``BB_HASHSERVE = "auto"``,
> >> +   allowing to connect the local server to an upstream one.
> >> +
> >> +-  :term:`bitbake:BB_SIGNATURE_HANDLER`, which must be set  to ``OEEquivHash``.
> >> +
> >> +Therefore, the default configuration in Poky corresponds to the
> >> +below settings::
> >> +
> >> +   BB_HASHSERVE = "auto"
> >> +   BB_SIGNATURE_HANDLER = "OEEquivHash"
> >> +
> >> +Rather than starting a local server, another possibility is to rely
> >> +on a Hash Equivalence server on a network, by setting::
> >> +
> >> +   BB_HASHSERVE = "<HOSTNAME>:<PORT>"
> >> +
> >> +.. note::
> >> +
> >> +   The shared Hash Equivalence server needs to be maintained together with the
> >> +   Share State cache. Otherwise, the server could report Shared State hashes
> >> +   that only exist on specific clients.
> >> +
> >> +   We therefore recommend that one Hash Equivalence server be set up to
> >> +   correspond with a given Shared State cache, and to start this server
> >> +   in *read-only mode*, so that it doesn't store equivalences for
> >> +   Shared State caches that are local to clients.
> > I think this could be a little more clear, like: If you have an sstate
> > cache that you are using as a read-only mirror (e.g. using
> > SSTATE_MIRRORS) that was generated using hash equivalence, you should
> > also publish the hash equivalence database using a read-only hash
> > equiv server, otherwise you will get very poor sstate use from the
> > mirror.
>
> This sounds better indeed, thanks. I'll update this note.
> Thanks again, and thanks in advance for the few clarifications about how
> the unihash is stored.
> Cheers
> Michael.
>
> --
> Michael Opdenacker, Bootlin
> Embedded Linux and Kernel engineering
> https://bootlin.com
>

Patch

diff --git a/documentation/overview-manual/concepts.rst b/documentation/overview-manual/concepts.rst
index 6f8a3def69..781ba1b070 100644
--- a/documentation/overview-manual/concepts.rst
+++ b/documentation/overview-manual/concepts.rst
@@ -1938,6 +1938,132 @@  another reason why a task-based approach is preferred over a
 recipe-based approach, which would have to install the output from every
 task.
 
+Hash Equivalence
+----------------
+
+The above section explained how BitBake skips the execution of tasks
+which output can already be found in the Shared State cache.
+
+During a build, it may often be the case that the output / result of a task might
+be unchanged despite changes in the task's input values. An example might be
+whitespace changes in some input C code. In project terms, this is what we define
+as "equivalence".
+
+To keep track of such equivalence, BitBake has to manage three hashes
+for each task:
+
+- The *task hash* explained earlier: computed from the recipe metadata,
+  the task code and the task hash values from its dependencies.
+  When changes are made, these task hashes are therefore modified,
+  causing the task to re-execute. The task hashes of tasks depending on this
+  task are therefore modified too, causing the whole dependency
+  chain to re-execute.
+
+- The *output hash*, a new hash computed from the output of Shared State tasks,
+  tasks that save their resulting output to a Shared State tarball.
+  The mapping between the task hash and its output hash is reported
+  to a new *Hash Equivalence* server. This mapping is stored in a database
+  by the server for future reference.
+
+- The *unihash*, a new hash, initially set to the task hash for the task.
+  This is used to track the *unicity* of task output, and we will explain
+  how its value is maintained.
+
+When Hash Equivalence is enabled, BitBake computes the task hash
+for each task by using the unihash of its dependencies, instead
+of their task hash.
+
+Now, imagine that a Shared State task is modified because of a change in
+its code or metadata, or because of a change in its dependencies.
+Since this modifies its task hash, this task will need re-executing.
+Its output hash will therefore be computed again.
+
+Then, the new mapping between the new task hash and its output hash
+will be reported to the Hash Equivalence server. The server will
+let BitBake know whether this output hash is the same as a previously
+reported output hash, for a different task hash.
+
+If the output hash is already known, BitBake will update the task's
+unihash to match the original task hash that generated that output.
+Thanks to this, the depending tasks will keep a previously recorded
+task hash, and BitBake will be able to retrieve their output from
+the Shared State cache, instead of re-executing them. Similarly, the
+output of further downstream tasks can also be retrieved from Shared
+Shate.
+
+If the output hash is unknown, a new entry will be created on the Hash
+Equivalence server, matching the task hash to that output.
+The depending tasks, still having a new task hash because of the
+change, will need to re-execute as expected. The change propagates
+to the depending tasks.
+
+To summarize, when Hash Equivalence is enabled, a change in one of the
+tasks in BitBake's run queue doesn't have to propagate to all the
+downstream tasks that depend on the output of this task, causing a
+full rebuild of such tasks, and so on with the next depending tasks.
+Instead, when the output of this task remains identical to previously
+recorded output, BitBake can safely retrieve all the downstream
+task output from the Shared State cache.
+
+This applies to multiple scenarios:
+
+-  A "trivial" change to a recipe that doesn't impact its generated output,
+   such as whitespace changes, modifications to unused code paths or
+   in the ordering of variables.
+
+-  Shared library updates, for example to fix a security vulnerability.
+   For sure, the programs using such a library should be rebuilt, but
+   their new binaries should remain identical. The corresponding tasks should
+   have a different output hash because of the change in the hash of their
+   library dependency, but thanks to their output being identical, Hash
+   Equivalence will stop the propagation down the dependency chain.
+
+-  Native tool updates. Though the depending tasks should be rebuilt,
+   it's likely that they will generate the same output and be marked
+   as equivalent.
+
+This mechanism is enabled by default in Poky, and is controlled by three
+variables:
+
+-  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote Hash
+   Equivalence server to use.
+
+-  :term:`BB_HASHSERVE_UPSTREAM`, when ``BB_HASHSERVE = "auto"``,
+   allowing to connect the local server to an upstream one.
+
+-  :term:`bitbake:BB_SIGNATURE_HANDLER`, which must be set  to ``OEEquivHash``.
+
+Therefore, the default configuration in Poky corresponds to the
+below settings::
+
+   BB_HASHSERVE = "auto"
+   BB_SIGNATURE_HANDLER = "OEEquivHash"
+
+Rather than starting a local server, another possibility is to rely
+on a Hash Equivalence server on a network, by setting::
+
+   BB_HASHSERVE = "<HOSTNAME>:<PORT>"
+
+.. note::
+
+   The shared Hash Equivalence server needs to be maintained together with the
+   Share State cache. Otherwise, the server could report Shared State hashes
+   that only exist on specific clients.
+
+   We therefore recommend that one Hash Equivalence server be set up to
+   correspond with a given Shared State cache, and to start this server
+   in *read-only mode*, so that it doesn't store equivalences for
+   Shared State caches that are local to clients.
+
+   See the :term:`BB_HASHSERVE` reference for details about starting
+   a Hash Equivalence server.
+
+See the `video <https://www.youtube.com/watch?v=zXEdqGS62Wc>`__
+of Joshua Watt's `Hash Equivalence and Reproducible Builds
+<https://elinux.org/images/3/37/Hash_Equivalence_and_Reproducible_Builds.pdf>`__
+presentation at ELC 2020 for a very synthetic introduction to the
+Hash Equivalence implementation in the Yocto Project.
+
 Automatically Added Runtime Dependencies
 ========================================