Post

The Zero-Downtime Migration: Moving Data Without the 'Incident Report'

The Zero-Downtime Migration: Moving Data Without the 'Incident Report'

Modifying a database schema on a live system with millions of rows is a high-risk operation. A standard ALTER TABLE to rename or delete a column will lock the table, causing the application to hang and crash as soon as the current code version tries to access the old schema.

To avoid service disruption, we use the Expand and Contract Pattern (also known as Parallel Changes). This pattern ensures backward compatibility by evolving the database in phases rather than a single destructive change.

For the architectural foundation, refer to Tim Wellhausen’s Expand and Contract Pattern and Martin Fowler’s guide to Parallel Changes.


The 5-Phase Strategy

Assume you need to rename old_column to new_column on a high-traffic table.

Phase 1: Expand

Add the new column (new_column) to the schema. It must be nullable or have a safe default value.

  • Status: Database has both columns. Application reads and writes only to old_column.

Phase 2: Dual Write

Update the application logic to write data to both columns during every INSERT and UPDATE operation.

  • Status: New data is synchronized across both “universes.” Existing historical data is still missing in the new column.

Phase 3: Backfill

Run a background worker to copy data from old_column to new_column for all historical rows.

Phase 4: Cut-over

Once the data is identical across both columns, update the application code to read exclusively from new_column.

Phase 5: Contract

After monitoring the logs for 48-72 hours, run a final migration to drop the old_column.

  • Status: Migration complete. The “Contract” phase is only safe once you are sure no legacy services or background tasks are touching the old field.

Operational Safety Protocols

1. Avoid Database Locks

Standard migrations can still hang a database under heavy load. Use online schema change tools to handle the heavy lifting:

  • MySQL: gh-ost (GitHub) or pt-online-schema-change (Percona).
  • PostgreSQL: Utilize CONCURRENTLY for index creation and ensure statement_timeout is configured.

2. The Multi-Release Requirement

A zero-downtime migration is physically impossible in a single deployment. It requires at least two release cycles:

  1. Release A: Add column + Dual Write + Backfill.
  2. Release B: Change Read + Drop Old Column.

Conclusion

Zero-downtime is achieved by trading speed for safety. By never renaming or deleting in a single step, you maintain the “Parallel Universe” where both the old and new code versions can survive.

This post is licensed under CC BY 4.0 by the author.