Skip to main content

Reproducible Builds vs Semantic Versioning

· 10 min read

Today, the most common way for developers to address a dependency is through the use of semantic versioning. This is where you create project A that depends on B at 1.2.3, where 1 is the major version, 2 is the minor version and 3 is the patch version.

Accordingly, the rules of following semantic versioning means:

  1. Major versions change when you make incompatible API changes.
  2. Minor versions change when you add functionality in a backwards-compatible manner.
  3. Patch versions change when you make backwards-compatible bug fixes.

Theoretically you should be able to update patch and minor versions of dependencies without any kind of problem. In fact many package managers recommend that you should use the "next significant release" dependency address notation of ~1.2 which means >=1.2 <2.0.0 or ~1.2.3 meaning >=1.2.3 <1.3.0.

In practice this often can lead to subtle bugs, and occasional flaky builds. There's just nothing that guarantees that updating a patch version will not break a build, and sometimes software is so complex, the runtime behaviour of a minor version bump might cause problems in production or at scale. Over time we've learned that dealing with these nightmares is not worth it, and that fixed versions is the right way to go. If we want to update, we must verify the update works. This idea of fixing versions is done by many package managers, such as PHP's composer will create a composed.lock that will lock down all the versions as long as you don't run another composer update.

However, there's still one more problem, there's no guarantee that what is 1.2.3 today is what's 1.2.3 a month from now. This is because there's no direct relationship between a semantic version tag and the actual contents of the dependency. Instead, there's just a third party that you are trusting when you are asking them for the package that is 1.2.3. The only constraint that 1.2.3 is enforcing, is that the package must say it's 1.2.3. However the package could have changed in the meantime, while spoofing the package version.

If only there was a way to refer to some package or dependency, not by some arbitrary tag, but by an address that represented the contents of the package...

Well there is! Check out Content Addressable Storage. The idea is simple, we need a crytographic hash function to generate a digest of the package contents, store in into a location that is referencable by hashes.

While this is not a new idea, doing a short google search shows that there is only 3 package management systems today that can be considered a content addressable package system:

Having content addressable dependencies is the first of the necessary factors for reproducible builds. The second necessary factor is for mirrored dependency storage. You cannot rely on the third party to maintain availability of the dependency (they may go down for a myriad of reasons: downtime, ddos, network problems, lawsuits like the leftpad fiasco).

Nix and Git are the 2 systems that I'm most familiar with in the above list. It turns out both systems can satisfy both of the necessary factors, but not out of the box.

For Nix, while we get content addressable dependencies. The official nixpkgs does not do any dependency or source code mirroring (it does have a binary cache, but that's not what I'm talking about). This means from time to time, you can get build failures because the upstream source code has gone offline, in fact this happened to me a month ago with the dependencies from freedesktop.org. However you can work around this, by creating your own mirrors of the upstream dependencies, and creating nix expressions for them. It is however not currently a first class feature.

For Git, content address package management relies on the cutting edge submodule feature of Git. So cutting edge, that you need the latest Git 2.8 to get everything to work. Submodules are basically subproject Git repositories within a superproject Git repository. In fact you can actually clone any Git repository, and you can do work on that subproject and push from it to the upstream remote without affecting the superproject other than updating the commit hash that the superproject tracks. The key feature is that the superproject tracks the commit hash of the subprojects, hence all dependencies here are identified by their content address. So that when you later clone the superproject, you will get exactly the dependencies that were specified by the content address, and nothing else. Git currently uses a SHA1 hash as their commit hash.

Here we demonstrate a simple way of using Git submodules:

> cd /tmp
> mkdir --parents SuperProject
> cd SuperProject
> git init
Initialized empty Git repository in /tmp/SuperProject/.git/
> echo 'Hello World' > ./README.md
> git add --all
> git commit --message='Hello World'
[master (root-commit) a60c612] Hello World
1 file changed, 1 insertion(+)
create mode 100644 README.md
> mkdir --parents modules
> git ls-remote https://github.com/CMCDragonkai/bashpp.git
5e94d3c285f4424f3fa4c6ddf036f405688b1d51 HEAD
5e94d3c285f4424f3fa4c6ddf036f405688b1d51 refs/heads/master
5e94d3c285f4424f3fa4c6ddf036f405688b1d51 refs/tags/0.0.1
> git submodule add https://github.com/CMCDragonkai/bashpp.git modules/bashpp
Cloning into 'modules/bashpp'...
remote: Counting objects: 344, done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 344 (delta 0), reused 0 (delta 0), pack-reused 338
Receiving objects: 100% (344/344), 66.20 KiB | 46.00 KiB/s, done.
Resolving deltas: 100% (170/170), done.
Checking connectivity... done.
> git diff --staged
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..10f797f
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "modules/bashpp"]
+ path = modules/bashpp
+ url = https://github.com/CMCDragonkai/bashpp.git
diff --git a/modules/bashpp b/modules/bashpp
new file mode 160000
index 0000000..5e94d3c
--- /dev/null
+++ b/modules/bashpp
@@ -0,0 +1 @@
+Subproject commit 5e94d3c285f4424f3fa4c6ddf036f405688b1d51
> git add --all
> git commit --message='Added bashpp as a dependency.'
[master fc545ee] Added bashpp as a dependency.
2 files changed, 4 insertions(+)
create mode 100644 .gitmodules
create mode 160000 modules/bashpp
> git submodule status modules/bashpp
5e94d3c285f4424f3fa4c6ddf036f405688b1d51 modules/bashpp (0.0.1)
> git ls-tree master modules/bashpp
160000 commit 5e94d3c285f4424f3fa4c6ddf036f405688b1d51 modules/bashpp
> cd modules/bashpp
> git checkout cefe74b6d6006ee75897790ed2090b8bc7741ed9
Note: checking out 'cefe74b6d6006ee75897790ed2090b8bc7741ed9'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b <new-branch-name>

HEAD is now at cefe74b... Changed shebang to be more generic
> cd ../..
> git status --verbose --verbose
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)

modified: modules/bashpp (new commits)

--------------------------------------------------
Changes not staged for commit:
diff --git i/modules/bashpp w/modules/bashpp
index 5e94d3c..cefe74b 160000
--- i/modules/bashpp
+++ w/modules/bashpp
@@ -1 +1 @@
-Subproject commit 5e94d3c285f4424f3fa4c6ddf036f405688b1d51
+Subproject commit cefe74b6d6006ee75897790ed2090b8bc7741ed9
no changes added to commit (use "git add" and/or "git commit -a")
> git add --all
> git commit --message='Fixed bashpp dependency to cefe74b6d6006ee75897790ed2090b8bc7741ed9'
[master dacb700] Fixed bashpp dependency to cefe74b6d6006ee75897790ed2090b8bc7741ed9
1 file changed, 1 insertion(+), 1 deletion(-)
> git submodule status modules/bashpp
cefe74b6d6006ee75897790ed2090b8bc7741ed9 modules/bashpp (0.0.1~2)
> git submodule update --remote --merge modules/bashpp
Updating cefe74b..5e94d3c
Fast-forward
bin/bashpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Submodule path 'modules/bashpp': merged in '5e94d3c285f4424f3fa4c6ddf036f405688b1d51'
> git status --verbose --verbose
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)

modified: modules/bashpp (new commits)

--------------------------------------------------
Changes not staged for commit:
diff --git i/modules/bashpp w/modules/bashpp
index cefe74b..5e94d3c 160000
--- i/modules/bashpp
+++ w/modules/bashpp
@@ -1 +1 @@
-Subproject commit cefe74b6d6006ee75897790ed2090b8bc7741ed9
+Subproject commit 5e94d3c285f4424f3fa4c6ddf036f405688b1d51
no changes added to commit (use "git add" and/or "git commit -a")

What we did above was to create SuperProject, added a submodule called bashpp which was put into modules/bashpp. We didn't specify any further constraints when it was added, therefore it was cloned into the default branch master. However, we could then go into the submodule, and checkout a specific commit hash (for content addressing), branch (for rolling updates) or tag (for semantic versioning), however for proper content addressable dependencies, you should prefer the commit hash, and use git ls-remote to acquire the commit hash for any particular tag or branch. Upon doing so, the superproject recognised this, and realised that the submodule is now tracking that specific version.

Subsequently, you can push this to Github, and clone it along with its submodules at exact their associated commit hash that you previously commited to the SuperProject. Here's a demo:

> git clone <repo>
> cd <repo-path>
> git submodule update --init --recursive --depth 1

The above should acquire the transitive closure of dependencies, and only at depth 1, that is not acquiring all dependencies. Currently however it the --depth 1 is a bit flaky. So it may be just safer to do git clone --recursive <repo> which will acquire the entire repository for all the submodules.

However, there is a feature that is currently missing, which is that the ability to acquire a clone/submodule of a Git repository at a specific commit and only at that commit, meaning a commit depth of 1. Currently as you see above, we have to acquire the entire development repository prior to fixing it at a particular commit hash. There's a discussion regarding this feature (or lack there of) here: http://stackoverflow.com/questions/26135216/why-isnt-there-a-git-clone-specific-commit-option . Ideally in the future, we can just do git submodule add --branch dadbf912508272c268c809f97955e182854f79be --depth 1 <repository> [path] in order to acquire a dependency at exactly that commit hash. It does preclude us from being able to develop within that submodule, but you should do that directly in a different clone.

Furthermore, the tooling for manipulating the hashes of the Git submodule called "gitlink" is still not fleshed out, at the moment we can only read the gitlink using git ls-tree or git submodule status, but not change the "gitlink" directly. We have to instead indirectly change the gitlink via updating the submodule repository HEAD reference ( using git checkout or git reset commands).

More information:

There's quite a few more features to submodules including recursive operations, where submodules contain further submodules. Therefore, as a feature it's still got quite a few rough edges, and the workflow isn't fully fleshed out.

In comparing the 2 workflows Nix and Git, if you have a NixOS environment to deploy into, you should definitely prefer to use Nix style. However, if you want your project distributable to people who aren't using NixOS, Git submodules is a viable alternative to get started using content addressable dependencies.

For both of them, it's important to consider the second necessary factor to reproducible builds. For Nix, make sure to mirror the source. For Git, you can just fork the project, prior to adding them as a dependency. However I'm looking forward to the integration of IPFS with Nixpkgs.

Now does this mean semantic versioning is out, and dadbf912508272c268c809f97955e182854f79be is in? No. Semantic versioning is still very useful as a metatag of software releases. It communicates intent of the developers, and that's very handy. But it's not something we should be relying on when building production software.

In this post, I haven't addressed the "harmoniousness" of a package set (the idea that packages in package set should work together out of the box with no conflicts). But this is solved Nix and Nixpkgs as well. And it also solves the problem with duplicate dependencies by flattening the hierarchy and hardlinking dependencies with the same hash. Nonetheless, it still requires hardwork from a community in making all the packages in a package set work without problems, so it's an ongoing effort.