Friday, August 24, 2012

About Versions

A version (like 1.2.3) is a remarkably ill defined concept, Wikipedia does not even have a lemma for it and several readers will remember the extensive discussions about version syntax between OSGi and Sun. For an industry where versioning (a word the spelling checker flags)  is at the root of their business this comes a bit as a shock. This article tries to define the concept of a version and to propose a versioning model for software artifacts with the idea to start a discussion.

A software version is a promise that a program will change in the future, the version is a discriminator between these different revisions of the same program. A program is conceptual, it represents source code, the design documents, the people, ideas, etc. A program is intended to be used as a library for other software developers or as an application. If we talk about the Apache commons project then it maintains multiple programs, like for example commons-lang. A revision is a reified (made concrete) representation of a program, for example a JAR file. The version of a revision discriminates the revision between all other revisions that exist and promises that this will also be true for future revisions.

This last requirement would be easy to fulfill with a unique identifier, for example a sufficiently large digest (e.g. SHA-1) of some identifying file in/of the revision. This was actually the approach taken in .NET. However, the clients of the program will receive multiple revisions over time; they will not only need to discriminate between the revisions (digests work fine for that), they will in general need to make decisions about compatibility.

The most common model for deciding compatibility is to make the version identifier comparable. The assumption behind is that if version a is higher than version b then a can substitute for b; higher versions are backward compatible with earlier versions. In this model, an integer would suffice. However, a version is a message to the future, it is a promise to evolve the program in time withe multiple revisions.

Since the version is such a handy little place to describe an artifact, versions over time were heavily abused to carry more information than just this single integer. Tiny Domain Specific Language (DSL) were developed to convey commitments to future users. As usual, the domain specific language ended as a developer specific language. Versions are especially abused when they convey variations of the same revision. For example, an artifact for Java 1.4 and the same source code compiled for Java 7. These are not versions, but variations, another dimension.

The lack of de-jura or de-facto standard for versioning made relying on the implicit promises in versions hard and haphazard. Worse of all, it makes it impossible to develop tools that take the chores of maintaining versions out of our hands.

A few years ago a movement in the industry coined semantic versioning. At about the same time the OSGi came out with the Semantic Versioning whitepaper that was based on some identical and some very similar ideas. Basically, these are attempts to standardize the version DSL so tools can take over the versioning chores. Tools are important because versioning is hard and humans are really bad in it. And with the exponentially increasing number of dependencies we are going to loose without tools.

Semantic versions consists of 4 parts (following the OSGi definition):
  1. major - Signals backward incompatible changes.
  2. minor - Signals backward compatibility for clients of an API, however, it breaks providers of this API.
  3. micro - Bug fix, fully compatible, also called patch.
  4. qualifier - Build identifier.
In general the industry is largely consensus on the first three parts, the contention is in the qualifier. Since the qualifier has only one requirement, being comparable and, in contrast with the first three parts, can hold more than digits, it is the outlet for developer creativity.  Long date strings, appended with git SHA digests, internal numbers, etc.

The qualifier's flexibility made a perfect candidate to signal the phase change any revision has to go through in its existence. The phase of a revision indicates where it fits in the development life cycle: developers sharing revisions because they work closely together, release to quality assurance for testing, approval from management to make it public, retiring of a revision because it is superseded by a newer revision, and in rare cases withdrawn when it contains serious bugs. The qualifier became the discriminator to signal some of these phases: qualifiers like BETA1, RC8, RELEASE, FINAL etc.
Using the qualifier to signal a phase implies a change after the revision has been quality assured, which implies a complete retest since changing a version can affect resolution processes, which can affect the test results. It also suffers from the qualifiers that invariably pop up with in this model called REALLYFINAL and REALLYREALLYFINAL, and PLEASEGODLETHISBETHEFINALONE. Also, this model does not allow a revision to retire or be withdrawn since the revision is out there, digested, and unmodifiable.

It should therefore be clear that the phase of a revision can logically not be part of that revision. The phase should therefore be maintained in the repository.The process I see is  as follows.

It all starts with a plan to build a new revision, lets say Foo-1.2.3 that is a new version of Foo-1.2.2. Since Foo-1.2.3 is a new version, the repository allows the developers to (logically) overwrite the previous revisions. That is, requesting Foo-1.2.3 from the repository returns the latest built, e.g. Foo-1.2.3-201208231010. As soon as possible the revisions should be built as if they are the final released revision.

At a certain point in time the revisions need to be approved by Quality Assurance (QA). The development group then changes the phase of the revisions to be tested to testing or approval. This effectively locks the revision in the repository, it is no longer possible to put new revisions with the same major, minor, micro parts. If QA approves the actual revisions then the phase is set to master, otherwise, it is set back to staging so the development group can continue to built. After a revision is in the master phase it becomes available to "outsiders", before this moment, only a selected group had visibility to the revision depending on repository policies.

If the revision is valid then existing projects that depend on that revision should never have to change unless they decide to use new functionality that is not available in their current dependency. However, repositories grow over time. Currently maven central is more than 500 Gb, contains over 4 million files,  more than 40.000 programs and a staggering 350.000 revisions. Most of these revisions have been replaced by later revisions, still a new user is confronted with all this information. It is clear that we need to archive revisions when they are superseded. Archiving must hide the revision for new users, it must still be available for existing users of the artifact.

Last but not least, it is also necessary to expire a revision in exceptional cases when the revision causes more harm, for example a significant security bug, than the resulting build failure.

Summary of the phases:
  1. staging - Available as major.minor.micro without qualifier, not visible in searches
  2. candidate - Can no longer be overwritten but is not searchable. Can potentially move back to staging but in general will move to master.
  3. master - Becomes available for searches and is ok to rely on.
  4. retired - Should no longer be used for new projects but is available for existing references
  5. withdrawn - Revision is withdrawn and might break builds
I am very interested in feedback and pointers.

   Peter Kriens