overview-manual: add details about hash equivalence

Message ID 20211217171859.54664-1-michael.opdenacker@bootlin.com
State New
Headers show
Series overview-manual: add details about hash equivalence | expand

Commit Message

Michael Opdenacker Dec. 17, 2021, 5:18 p.m. UTC
In particular, mention the different hashes which are
managed in Hash Equivalence mode: task hash, output hash and unihash.

Signed-off-by: Michael Opdenacker <michael.opdenacker@bootlin.com>
---
 documentation/overview-manual/concepts.rst | 97 +++++++++++++++++-----
 1 file changed, 76 insertions(+), 21 deletions(-)

Comments

Richard Purdie Jan. 7, 2022, 11:11 a.m. UTC | #1
On Fri, 2021-12-17 at 18:18 +0100, Michael Opdenacker wrote:
> In particular, mention the different hashes which are
> managed in Hash Equivalence mode: task hash, output hash and unihash.
> 
> Signed-off-by: Michael Opdenacker <michael.opdenacker@bootlin.com>
> ---
>  documentation/overview-manual/concepts.rst | 97 +++++++++++++++++-----
>  1 file changed, 76 insertions(+), 21 deletions(-)
> 
> diff --git a/documentation/overview-manual/concepts.rst b/documentation/overview-manual/concepts.rst
> index 2d3d6f8040..2df5011ef6 100644
> --- a/documentation/overview-manual/concepts.rst
> +++ b/documentation/overview-manual/concepts.rst
> @@ -1942,19 +1942,60 @@ Hash Equivalence
>  ----------------
>  
>  The above section explained how BitBake skips the execution of tasks
> -which output can already be found in the Shared State Cache.
> +which output can already be found in the Shared State cache.
>  
>  During a build, it may often be the case that the output / result of a task might
>  be unchanged despite changes in the task's input values. An example might be
>  whitespace changes in some input C code. In project terms, this is what we define
> -as "equivalence". We can create a hash / checksum which represents a task and two
> -input task hashes are said to be equivalent if the hash of the generated output
> -(as stored / restored by sstate) is the same.
> -
> -Once bitbake knows that two input hashes for a task have equivalent output,
> -this has important and useful implications for all tasks depending on this task.
> -
> -Thanks to this equivalence, a change in one of the tasks in BitBake's run queue
> +as "equivalence".
> +
> +To keep track of such equivalence, BitBake has to manage three hashes
> +for each task:
> +
> +- The *task hash* explained earlier: computed from the recipe metadata,
> +  the task code and the task hash veported to be different, BitBake will update
+the task's unihash, causing the task hash of depending tasks to be
> +modified too, and making such tasks ralues from its dependencies.
> +  When changes are made, these task hashes are therefore modified,
> +  causing the task to re-execute. The task hashes of tasks depending on this
> +  task are therefore modified too, causing the whole dependency
> +  chain to re-execute.
> +
> +- The *output hash*, a new hash computed from the output of Shared State tasks,
> +  tasks that save their resulting output to a Shared State tarball.
> +  The mapping between the task hash and its output hash is reported
> +  to a new *Hash Equivalence* server. This mapping is stored in a database
> +  by the server for future reference.
> +
> +- The *unihash*, a new hash, initially set to the task hash for the task.
> +  This is used to track the *unicity* of task output, and we will explain
> +  how its value is maintained.
> +
> +When Hash Equivalence is enabled, BitBake computes the task hash
> +for each task by using the unihash of its dependencies, instead
> +of their task hash.
> +
> +Now, imagine that a Shared State task is modified because of a change in
> +its code or metadata, or because of a change in its dependencies.
> +Since this modifies its task hash, this task will need re-executing.
> +Its output hash will therefore be computed again.
> +
> +Then, the new mapping between the new task hash and its output hash
> +will be reported to the Hash Equivalence server. The server will
> +let BitBake know whether this output hash is the same as a previously
> +reported output hash, for a different task hash.
> +
> +If the output hash is reported to be different, BitBake will update
> +the task's unihash, causing the task hash of depending tasks to be
> +modified too, and making such tasks re-execute. This change is
> +propagating to the depending tasks.
>
> +On the contrary, if the output hash is reported to be identical
> +to the previously recorded output hash, BitBake will keep the
> +task's unihash unmodified. Thanks to this, the depending tasks
> +will keep the same task hash, and won't need re-executing. The
> +change is not propagating to the depending tasks.
> 

These paragraphs are reversed and this is an important detail to get right. The
output hash is always computed for a task that runs and the output hash is
queried on the hash equivalence server.

If the output hash is known, the unihash is updated to match the original input
hash that generated that output. If the output hash is unknown, a new entry is
created on the hash equivalence server matching that task hash to that output.

The unihash would therefore be unchanged for a new output hash and would update
if the output hash matched some other value already there.

Cheers,

Richard
Michael Opdenacker Jan. 7, 2022, 6:55 p.m. UTC | #2
Hi Richard,

On 1/7/22 12:11 PM, Richard Purdie wrote:
> These paragraphs are reversed and this is an important detail to get right. The
> output hash is always computed for a task that runs and the output hash is
> queried on the hash equivalence server.
>
> If the output hash is known, the unihash is updated to match the original input
> hash that generated that output. If the output hash is unknown, a new entry is
> created on the hash equivalence server matching that task hash to that output.
>
> The unihash would therefore be unchanged for a new output hash and would update
> if the output hash matched some other value already there.


Many thanks for this important correction.
I knew I could use a review!

I'll post a new version right away.

Thanks again
Michael.

Patch

diff --git a/documentation/overview-manual/concepts.rst b/documentation/overview-manual/concepts.rst
index 2d3d6f8040..2df5011ef6 100644
--- a/documentation/overview-manual/concepts.rst
+++ b/documentation/overview-manual/concepts.rst
@@ -1942,19 +1942,60 @@  Hash Equivalence
 ----------------
 
 The above section explained how BitBake skips the execution of tasks
-which output can already be found in the Shared State Cache.
+which output can already be found in the Shared State cache.
 
 During a build, it may often be the case that the output / result of a task might
 be unchanged despite changes in the task's input values. An example might be
 whitespace changes in some input C code. In project terms, this is what we define
-as "equivalence". We can create a hash / checksum which represents a task and two
-input task hashes are said to be equivalent if the hash of the generated output
-(as stored / restored by sstate) is the same.
-
-Once bitbake knows that two input hashes for a task have equivalent output,
-this has important and useful implications for all tasks depending on this task.
-
-Thanks to this equivalence, a change in one of the tasks in BitBake's run queue
+as "equivalence".
+
+To keep track of such equivalence, BitBake has to manage three hashes
+for each task:
+
+- The *task hash* explained earlier: computed from the recipe metadata,
+  the task code and the task hash values from its dependencies.
+  When changes are made, these task hashes are therefore modified,
+  causing the task to re-execute. The task hashes of tasks depending on this
+  task are therefore modified too, causing the whole dependency
+  chain to re-execute.
+
+- The *output hash*, a new hash computed from the output of Shared State tasks,
+  tasks that save their resulting output to a Shared State tarball.
+  The mapping between the task hash and its output hash is reported
+  to a new *Hash Equivalence* server. This mapping is stored in a database
+  by the server for future reference.
+
+- The *unihash*, a new hash, initially set to the task hash for the task.
+  This is used to track the *unicity* of task output, and we will explain
+  how its value is maintained.
+
+When Hash Equivalence is enabled, BitBake computes the task hash
+for each task by using the unihash of its dependencies, instead
+of their task hash.
+
+Now, imagine that a Shared State task is modified because of a change in
+its code or metadata, or because of a change in its dependencies.
+Since this modifies its task hash, this task will need re-executing.
+Its output hash will therefore be computed again.
+
+Then, the new mapping between the new task hash and its output hash
+will be reported to the Hash Equivalence server. The server will
+let BitBake know whether this output hash is the same as a previously
+reported output hash, for a different task hash.
+
+If the output hash is reported to be different, BitBake will update
+the task's unihash, causing the task hash of depending tasks to be
+modified too, and making such tasks re-execute. This change is
+propagating to the depending tasks.
+
+On the contrary, if the output hash is reported to be identical
+to the previously recorded output hash, BitBake will keep the
+task's unihash unmodified. Thanks to this, the depending tasks
+will keep the same task hash, and won't need re-executing. The
+change is not propagating to the depending tasks.
+
+To summarize, when Hash Equivalence is enabled,
+a change in one of the tasks in BitBake's run queue
 doesn't have to propagate to all the downstream tasks that depend on the output
 of this task, causing a full rebuild of such tasks, and so on with the next
 depending tasks. Instead, BitBake can safely retrieve all the downstream
@@ -1970,18 +2011,21 @@  This applies to multiple scenarios:
    For sure, the programs using such a library should be rebuilt, but
    their new binaries should remain identical. The corresponding tasks should
    have a different output hash because of the change in the hash of their
-   library dependency, but thanks to their output being identical, hash
-   equivalence will stop the propagation down the dependency chain.
+   library dependency, but thanks to their output being identical, Hash
+   Equivalence will stop the propagation down the dependency chain.
 
 -  Native tool updates. Though the depending tasks should be rebuilt,
    it's likely that they will generate the same output and be marked
    as equivalent.
 
-This mechanism is enabled by default in Poky, and is controlled by two
+This mechanism is enabled by default in Poky, and is controlled by three
 variables:
 
--  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote hash
-   equivalence server to use.
+-  :term:`bitbake:BB_HASHSERVE`, specifying a local or remote Hash
+   Equivalence server to use.
+
+-  ``BB_HASHSERVE_UPSTREAM``, when ``BB_HASHSERVE = "auto"``,
+   allowing to connect the local server to an upstream one.
 
 -  :term:`bitbake:BB_SIGNATURE_HANDLER`, which must be set  to ``OEEquivHash``.
 
@@ -1991,19 +2035,30 @@  below settings::
    BB_HASHSERVE = "auto"
    BB_SIGNATURE_HANDLER = "OEEquivHash"
 
-Another possibility is to share a hash equivalence server on a network,
-by setting::
+Rather than starting a local server, another possibility is to rely
+on a Hash Equivalence server on a network, by setting::
 
    BB_HASHSERVE = "<HOSTNAME>:<PORT>"
 
 .. note::
 
-   The hash equivalence server needs to be maintained together with the
-   share state cache. Otherwise, the server could report shared state hashes
-   that do not exist.
+   The shared Hash Equivalence server needs to be maintained together with the
+   Share State cache. Otherwise, the server could report Shared State hashes
+   that only exist on specific clients.
+
+   We therefore recommend that one Hash Equivalence server be set up to
+   correspond with a given Shared State cache, and to start this server
+   in *read-only mode*, so that it doesn't store equivalences for
+   Shared State caches that are local to clients.
+
+   See the :term:`BB_HASHSERVE` reference for details about starting
+   a Hash Equivalence server.
 
-   We therefore recommend that one hash equivalence server be set up to
-   correspond with a given shared state cache.
+See the `video <https://www.youtube.com/watch?v=zXEdqGS62Wc>`__
+of Joshua Watt's `Hash Equivalence and Reproducible Builds
+<https://elinux.org/images/3/37/Hash_Equivalence_and_Reproducible_Builds.pdf>`__
+presentation at ELC 2020 for a very synthetic introduction to the
+Hash Equivalence implementation in the Yocto Project.
 
 Automatically Added Runtime Dependencies
 ========================================