Growing Pains and Other Observations

from Using PostgreSQL at Black Duck Software

Manni Wood
Senior Software Engineer
Black Duck Software
mwood@blackducksoftware.com

Part I

Lessons learned from the past.

Budgets / business requirements often leave you with a setup that is not as "by the book" as you'd like.

The ideal:

two (or more) identical servers, mirrored, geographically distributed
enough disk/tape for full backup of database
enough time in business cycle for database upgrades
a database schema designed for growth

Our reality (a couple of years ago):

our primary system always out-spec'd our mirror system
our mirror system was off-site
our mirror system was on an 8 hour delay when replaying shipped WAL files (gave us an 8 hour window to correct mistakes like accidental drop table)
database schema used int for primary keys
largest tables had only been broken down into a handful of still-massive partitions of unequal size

Pain points resulting from our reality

WAL file shipping to our mirror kept up under regular load

but

WAL file shipping always got bogged down when autovacuuming began on one of our massive partitions

Fun with vacuuming

Vacuuming of large tables is all or nothing
If you kill the vacuuming of a large table because your WAL file partition is filling up, you do not get partial credit
The whole vacuum job has to start over again.

Used to have to plan around "vacuum storms": stop processes from feeding the database so that the only thing generating WAL was the vacuum job.
Large WAL back-log would be created, and eventually dissipate.
Resume feeding the database.
Often, large vacuums were deferred because they started at inopportune times; just made the problem worse later.

Fun with integer primary keys

One of our tables was going to surpass 1.2 billion rows.

Integer primary key would not be enough. Needed to change schema to bigint...

...and all software that was using int had to be converted to use long.

Part II

Evolutionary Solutions

(Altering the structure of a ship, while sailing it, without sinking it.)

The looming 2.1 billion row limit

Not enough disk to completely re-write our new table using alter table.

What to do?

Set aside a weekend
Let our 8-hour-lag mirror catch up
Sever all connections to db
Dump current large table (partitions)
Truncate large tables
Alter schemas from int to bigint
Drop all indexes and foreign keys
Load data into new tables
Re-enable all indexes and foreign keys.
Start up versions of all services that use long instead of int

Followed a similar pattern for upgrading from 8.3 to 9.0

(Perhaps we should have used pg_upgrade? We were unsure of its readiness for prime time.)

The ideal

full dump and restore. (The approved way to do database upgrades.)

The problem

only a weekend to do dump and restore; the restore itself would be OK, but rebuilding indexes and re-enabling foreign key constraints takes FOREVER.

The approach we took

Phased dump/restore.

Before Game Weekend

Begin dumping large, static partitions early, while db still in operation

Game Weekend

shut off all connections to db
final dump of schema from 8.3
load schema to 9.0
load smaller tables and enable their indexes and fks
load largest data tables with their indexes/fks disabled
begin enabling indexes and fks, in parellel where possible (but witness paging as memory contention starts to happen, esp with maintenence_work_mem cranked way up)

After Game Weekend

indexes finished building on largest tables, but still enabling fks!
was finished by Tuesday.

For continued growth of the db, we needed to master partitions and vacuuming.

How should we partition?

First Possibility

Fixed number of partitions, with rows spread evenly across all partitions

Decided not to go this route: ideal number of partitions now could turn out to not be the ideal number of partitions later.

Second Possibility

Growing number of partitions, each of which is not allowed to exceed X rows

This is the route we took, because it better matched out work load.

We use a numeric (bigint) key that grows over time, so we created partitions that contained primary keys n through n + partition_size
Our business processes almost always operate on the newest additions to our tables, so older partitions essentially become static: vacuum freeze very handy here. The most recent partition is often the only "live" partition.

Inevitable challenges to "linear partitioning"

Abandoned clever triggers to auto-create new partitions as needed: they were too slow. Our largest tables are our most active tables.

Instead, we create empty partitions out in front of our business processes, and have cron jobs check how many empty partitions we have left, warning us when too few empty partitions exist.

Clever cron jobs to auto-create partitions ahead of time also had to be abandoned: DDLs would cause odd race conditions, other issues.

Only choice: pick a time to create more empty partitions, shut down business processes, create more partitions, restart business processes.

Up side: this is very quick.

Fancy new partitions in place?

Let's fix our vacuuming issues.

Autovacuum is nice, but active vacuuming is a must.

Chances are, you know your busiest tables better than the autovacuumer does.

We have one table that is so active (and so large) that we vacuum it every day.

These vacuums take a few minutes because we do them every night; they used to take hours, and they would grind some things to a halt.

Another more general-purpose cron job runs every three hours, looks at the partitions of our huge tables, picks the partition that is closest to being autovacuumed, and vacuums it manually instead.

Now that we are more proactive (and agressive) about vacuuming, and now that each of our partitions is small enough that it vacuums in reasonable time, our database no longer has those times when it grinds to a halt.

The Story Continues: More $$$ and changing business needs.

Business need for 8 hour lag on WAL shipping is removed, and we are on 9.0, so we switch to streaming replication to a live mirror!

Weekly snapshot server is changed to all SSDs with loads of RAM! Is now often pressed into service for many read-only workloads that do not need most up-to-minute data.