Declarative Stateless Migrations in Databases

Ever since we read Domen Kožar's thoughts on how Nix and NixOS is a superior configuration management system because it features:

  1. Declarative Semantics
  2. Stateless
    • Deterministic (pure)
    • Avoids dependency hell
    • Idempotent
  3. Atomic/Transactional
  4. Reversible or Rollbacks

We've been thinking about how to apply the Nix philosophy to live database migrations. (Specifically relational SQL databases.)

Our motivation is due to problems faced by current database migration solutions. Most solutions do not feature "desired-state" configuration. Instead the migrations are coded in imperative manner, coding the explicit actions to take in order to migrate schema or data. The migration can fail which means they are not-deterministic. In the instance of failure, they may not be able to be safely retried which means they are neither idempotent nor atomic/transactional. Once your state is corrupted due a partially failed migration, it is no longer possible to reverse. These are problems that are going to arise for any developer orchestrating large stateful systems.

Configuring an operating system is not fundamentally different from configuring a database. Is it possible for us to work out a theory that achieves those 4 properties (declarative, stateless, atomic, and reversible) for database migrations? Let's see!

Databases support 2 types of languages: Data Definition Languages (DDL) and Data Manipulation Languages (DML). During a migration both DDLs and DMLs can be used. During live operations, generally only DMLs are used, however there's nothing stopping one from using DDLs during live operations. DDL and DML can be used together, if you add a new column using DDL, you may also need to add new data using DML.

Any solution needs to achieve all of the properties for both DDLs and DMLs.

Declarative

SQL is somewhat declarative. But there are parts that are not declarative. For example a standard INSERT is a very imperative statement. In order to achieve the first property, we need to restrict ourselves to using declarative statements. A declarative statement is one that defines the end state that we want the database to be at. DML operations that are declarative include UPSERT and DELETE. Pretty much all DDL operations are declarative.

Stateless

To achieve the stateless property, we first need to understand what we mean by stateless configuration management. In the Nix and NixOS article, the author demonstrates that existing configuration management takes in the current state of the filesystem as a side-effectful input to the package build function.

Stateful Build

This makes the package build function impure. A pure function is a function where the return value is only determined by its input values, without observable side effects. The package build function inside Nix is a stateless function, as there are no observable side effects. All inputs are explicitly declared and passed into the build process.

Stateless Build

Simply put, stateless functions are deterministic.

We can see that SQL operations do not work as pure functions. They are side effectful, as the return value of the operation depends on the mutable state of the database, and the operation itself changes the state of the database. You could imagine that a SQL operation is a closure that is closed over the state of the database. If the state of the database was immutable, then the operation would be pure, but it isn't.

Nix and NixOS fixes this problem by forcing you to pass all required
dependencies into package build function. (An optimisation phase can find out shared dependencies and establish hardlinks, which saves space.) A package build function never has to rely on the existing filesystem state. Although the saving of the build output does in fact change the filesystem state. This mutation doesn't
affect subsequent executions of the build function, and doesn't affect the return value.

In order to acquire stateless SQL operations, each operation must either take in the entire database state as a parameter, or be considered a closure with a free variable that is filled by the database state, and return a new database state without changing the previous closed over database state.

The only kinds of databases that support such kind of SQL operations are the rare functional append-only databases like Datomic (sometimes called purely functional databases). However I do not know whether Datomic even supports append-only DDLs (schema changes).

It may be possible simulate a functional append-only database in normal SQL by creating unique database prefixes for each change. But this is totally impractical as the database structure is not shared. What we really need is some sort of persistent data structures or structural immutability or structure
sharing functionality
.

Thinking outside the box, an infrastructural solution is to shift your perspective on the single of source of truth to a temporal log structured database instead, and just consider your relational stores as a current-database representation/lens/view. This prevents the need to think about migrations as changing the source of truth, but instead simply changing the lens structure in which you perceive the data.

Idempotency

In general we can't achieve the stateless feature that Nix possesses in mainstream SQL databases. But we can achieve idempotent property in a slightly less expressive way. What we need is idempotent DDLs and DMLs in our migrations. Idempotence is the property that rerunning an operation should result in the same behaviour. Idempotency is a necessity for fault tolerance.

Idempotent DMLs rely not just on the availability of UPSERT and DELETE but also the exact semantics equality/inequality. For example the UPSERT command needs to branch between an UPDATE or INSERT. How does it decide whether the current state is conflicting with your UPSERT? What does it mean for 2 records to be conflicting? Most SQL systems do this through the existence of some sort of unique key. The uniqueness of that key defines the semantics of inequality, and thus signals to the database that there is a row conflict, which converts the UPSERT command into a UPDATE command. Another variation on UPSERT is the INSERT IGNORE or ON CONFLICT DO NOTHING, this basically branches between INSERT or a no-op. An interesting solution from service oriented architecture, is the usage of combining a unique hash for any given transaction {transaction_details, hash_id}, for example Stripe allows idempotent API charge calls.

Idempotent DDLs are even more challenging, as very few databases support this natively. So it requires a bit of hacking. Refer to these resources for more:

The advantage of idempotency is that it offers more safer migrations that can deal with migration failure, so theoretically you can just rerun all the migrations all the time.

The previous sentence needs to be considered carefully. It is only true, when you consider all mutations of the database as a migration in schema or data. If you consider that migrations scripts to be a subset of all mutations that have occurred on your database, then it is not safe to simply rerun all of your idempotent mutation scripts, because the current state of your data may no longer match your desired state specified in your migrations. You may end up affecting state that you didn't want changed.

In practice, if you are able to keep track all mutations as migrations, you can rerun migrations from the beginning, if not, then you always migrate forward and never backward. This implies a need to keep track of the the "version" your database is at, and to know where your migrations sit in the "version heritage tree" of your database.

Atomic/Transactional

Nix also makes its changes in an atomic manner. All package builds live in their own unique prefix, so changes do not affect other packages. Furthermore the packages are only symlinked to global configuration at the end after everything is built. That means if a package build is only partially completed because the network failed, or because the power failed, or you cancelled, the package will not have been symlinked to the global configuration, and you can run a garbage cleaning operation to remove any non-symlinked packages.

During a migration, our DDLs and DMLs must also be encapsulated in transactions. This is hard requirement. Not all databases support transactional DDLs (MySQL doesn't). Implicit commits are not legal here!

When any aspect of a migration fails, the entire migration should be rolled back. This is also important in the presence of syntax errors. You don't want a syntax error in one of your SQL statements to cause a partial operation.

Reversible or Rollbacks

Nix is also both reversible (revertible) and roll-backable. There's a difference between "reversing/reverting a migration" and "rolling back a migration". This difference is best demonstrated by the way git works. Reversing is a forward compensating action. Whereas rolling-back is a matter of rewriting history.

In NixOS every execution of system configuration is saved as a generation. You can easily rollback to a previous generation. Packages that exist in previous generations are not deleted unless you first garbage collect the old generations. It's also possible to make changes to your configuration that semantically "reverses" a previous configuration in a forward compensating manner.

While reversing a migration is a matter of writing a forward compensating migration, rolling-back migrations is a more difficult proposition for mainstream SQL databases, as it is not practical to hold all previous generations of your data. This problem is related to the stateless problem discussed above. Functional append-only databases provide a solution. However these databases would also need to be able to apply append-only logic to both DDL and DML.

Another problem here is coupling a DDL and DML migration together. In NoSQL databases, where the schema is just in the application logic, migrating DDL is just a matter of changing source code, but there's no obvious place to put your DML changes. The answer is to also couple it with your application deployment, something that is present in the same versioned commit that added your DDL changes.

For traditional SQL databases, there is a common theme to the alternatives proposed for statelessness and "rollbackability". It's basically some sort of log structured archiving, and point-in-time recovery. PostgreSQL offers "continuous archiving".. This feature is also called "Continuous Data Protection". Of
course it's easier to implement, if your database is already a log. This doesn't strictly give you properties of statelessness, but it does make migrations more safer.

Schema Evolution and Other Related Ideas

There is on-going database research to deal with these kinds of problems. The literature calls this "schema evolution". It's not just the evolution of the schema, but the underlying data as well. Futhering the idea is the ability of not just replaying data, or recovering data, but manipulating the past data as well, just like rewriting git history.

In programming language theory, there's a related concept called "modular extensibility". While for high-availability systems, hot-code reloading or zero-downtime deployments faces the exact same problems of declarativeness, statelessnes, atomic transactions and reversibility or "rollbackability". In game development, there's a similar issue of knowing how a change in a particular entity's property affects the game world's balance. This was investigated using graph theory and Neo4j.

Also checkout these slides talking about using git as a database.