Making changes to database schemas

Jan 2

Changing your data structures in a database (or any data store) involves a lot of hidden issues. There are race conditions to deal with and steps needed to introduce data validation that were not in place before. It can be a little daunting to some, and others might not give it enough thought. With the right understanding, it can be trivial indeed, but having that understanding is key.

Step 1 : Define the new data model

This is a very obvious step, but make sure you’ve done this completely. Don’t assume that you can easily change your mind later. You’re going through this data migration, so you’re probably well aware of the value to thinking ahead with your data schema.

Clearly define the new schema and itemize the changes from the old, as well as new additions and subtractions.

Step 2 : Change the data model in the application code

It’s helpful to “future proof” the application code by adapting the data models to be more like the new schema. For example, let’s say that you’re moving a one-to-one data relationship to a one-to-many. You can update the data model to support the new array of objects, even if that array will always contain just one object for the time being. Introduce new attributes with NULL values, and remove anything that is set to be removed.

You know best from here. Make sure your tests pass and any application changes are made for the new data models.

Step 3 : Create the new schema

Now you define the new data structure in parallel to the old. This might mean a completely separate table or tables with new naming to reflect the new data schema, or it might mean redundant attributes in one table. They must be able to live side by side.

If you have a column that you want to persist, but you want to change the validation for that data, it may be helpful to create a second column with the new constraints. This is a case by case decision, but changing the requirements on the existing data can get tricky. For example, you might run a query and decide that all of the existing data complies with the new validation rules, then a user might insert a new row that doesn’t comply before you trigger the update to the database. This would cause the update to fail and could cause other headaches depending on your DB technology. Adopting a new name for the column might be annoying, but it’s simpler and carries less risk. Your application code will likely convert it back to the old name in your model, via step 4.

Step 4 : Make application code schema agnostic

Much of this was handled in step 2, but now you have the new schema in place. So write tests mocking the new schema and make sure the application code can handle either data schema and the output data model is the same either way.

Step 5 : Start writing data in the new schema

Now you have to “stop the bleeding” and make sure that all CREATE, UPDATE, and DELETE paths adhere to the new schema. This removes the race condition concern when backfilling the existing data. The important thing is that all “changes” are in the new schema.

Step 6 : Backfill

With all new changes going to the new data schema, you just need to convert the old. If you want to pause here for a month or so, your job will just get easier and easier because natural usage will update many records thanks to step 5. But in any case, eventually you’ll want to clean things up and backfill the legacy data, converting it to the new data schema. You can either identify the legacy data rows based on the timestamp of when step 5 was deployed, or based on the nature of the data itself. I’d generally prefer the latter approach, but no matter how you approach it, run some queries and make sure it’s identifying the records correctly.

It might be tempting to map timestamps like created_at or updated_at to the new data schema, especially if it’s a new table, but don’t do that. Let them have their new timestamps from the new creation / update. You don’t want to remove the integrity of those values. If you have application logic that relies on these values to persist through a backfill, that’s an issue in of itself.

Step 7 : Simplify application code

Now, all data should be in the new schema. You can simplify your application code because it doesn’t have to be schema agnostic anymore. And you should actually have the code raise an exception if the old schema is used. You never expect it to exist, after all. So if you have some rogue hacky solution bypassing the application and inserting data in the old schema, this should alert you to that as quickly as possible.

Step 8 : Remove legacy schema

Now, you’ll clean up the DB schema by removing anything that is no longer needed because it was exclusive to the legacy schema. Only do this if you’re certain that everything went to plan, because you won’t be able to convert any data after this point. I would write this ticket but hold it for a month or two before making it ready for developers to pick up, just in case there’s a surprise around the corner. What’s important is that you clean up behind you in a reasonable timeframe, but it doesn’t have to be a fast-follow effort.

Step 9 : Celebrate

You’re now in your new data schema, which probably unlocks the next features in your roadmap. Congrats!

I hope this helps with your next data migration. I also hope that it encourages you to put extra effort into planning your v1 schemas. Four hours of additional planning might save you weeks of data migration efforts. Think ahead to future features or, better yet, do your best to have the data schema accurately reflect the reality of the situation.

It’s better to support more plausible data states in the database, even if the application code doesn’t allow them. It’s much easier to change application code. For example, maybe a user has many email addresses and, for now, they always have one with a boolean column called “primary” always set to TRUE. This leaves you open to more options in the future. This is not a hard rule, too much of this will add more complexity than you need, so you have to strike a balance.

Brett Clanton