Aggressive Cassandra Compaction

5 Nov 2017

The primary use case for Cassandra is not a lot of heavy updates, because tombstones for older outdated cells can take up a lot of space that has to be dealt with using more aggressive compaction.

However, Cassandra can be tuned to do compaction more aggressively so that tombstones are kept to a tollerable minimum.

For our test, let's put Cassandra in PostgreSQL-like durability mode, so that presumably our files will show up faster as they are flushed to disk. (As I was learning, I noticed that the files weren't even getting flushed to disk! Putting Cassandra in full durability mode fixed that.)

BeforeAfter
commitlog_sync: periodiccommitlog_sync: batch
#commitlog_sync_batch_window_in_ms: 2commitlog_sync_batch_window_in_ms: 2
commitlog_sync_period_in_ms: 10000#commitlog_sync_period_in_ms: 10000

Create a test keyspace and table:

create keyspace items
with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
and durable_writes = true;

drop table if exists items;
create table if not exists items.items (
  owner_id int,
  item_id text,
  item_count int,
  primary key (owner_id, item_id)  -- owner_id partition key, item_id clustering key
) with clustering order by (item_id asc)
and compaction = {
  'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
  'tombstone_compaction_interval': '60',
  'tombstone_threshold': '0.01',
  'unchecked_tombstone_compaction': 'true' }
and memtable_flush_period_in_ms = 10000  -- flush to sstables every 10 seconds no need for ./nodetool flush
and gc_grace_seconds = 60;

Note that the above settings are TOTALLY crazy aggressive settings to try to prove that Cassandra can be forced to do compactions WAY more often than the default.

Note on gc_grace_seconds:

gc_grace_seconds is set way more agressively to 60 seconds instead of 10 days so that after only 60 seconds, tombstones will be considered old enough to be permanently removed instead of merged into new files. WARNING: this means all down nodes have to be repaired within 60 secods, or zombie data could start re-appearing. SO OBVIOUSLY THIS IS NOT A PRODUCTION SETTING, but a setting to see how aggressively we can get Cassandra to do compactions.

Another way of looking at this is 60 seconds' worth of tombstones will be kept around (instead of 10 days' worth), and given our high level of deletions for this test, that will make things interesting.

For our test, let's load up 250,000 rows in our table.

Let's write some Go to do that.

package main

import (
	"fmt"
	"log"

	"github.com/gocql/gocql"
)

func main() {
	cluster := gocql.NewCluster("127.0.0.1")
	cluster.ProtoVersion = 4 // have to do this for Cass 3
	cluster.Keyspace = "items"
	session, err := cluster.CreateSession()
	if err != nil {
		log.Fatalf("Unable to connect to cluster: %v\n", err)
	}
	defer session.Close()

	var itemIdNum int
	var ownerId int
	for id := 1; id <= 250000; id++ {
		if itemIdNum%100 == 0 {
			itemIdNum = 0
			ownerId += 1
		}
		itemIdNum += 1
		itemId := fmt.Sprintf("iamaverylargeitemidthattakesupspace_%d", itemIdNum)
		err = session.Query("insert into items (owner_id, item_id, item_count) values (?, ?, ?)",
			ownerId, itemId, 0).Exec()

		if err != nil {
			log.Fatalf("Did not insert %d\n", id)
		}
		if id%100000 == 0 {
			fmt.Printf("inserted %d\n", id)
		}
	}
}

Let's get the files holding our sstable data:

$ ./sstableutil items items
Listing files...

Here is how much storage we took with our 250,000 items:

$ du -sh /home/mwood/apache-cassandra-3.10/data/data/items/items-577a43e01bc411e7b901ddd20f079ab4/
2.6M	/home/mwood/apache-cassandra-3.10/data/data/items/items-577a43e01bc411e7b901ddd20f079ab4/

Now that we have a bunch of rows in our Cassandra table, we need to do a bunch of mutations to see if compaction can keep up!

Here is a Go program that deletes 98 of 100 items for a owner:

package main

import (
	"fmt"
	"log"
	"os"
	"strconv"

	"github.com/gocql/gocql"
)

func main() {
	cluster := gocql.NewCluster("127.0.0.1")
	cluster.ProtoVersion = 4 // have to do this for Cass 3
	cluster.Keyspace = "items"
	session, err := cluster.CreateSession()
	if err != nil {
		log.Fatalf("Unable to connect to cluster: %v\n", err)
	}
	defer session.Close()

	ownerId, err := strconv.Atoi(os.Args[1])
	if err != nil {
		log.Fatalf("%s is not an integer (owner_id)\n", os.Args[1])
	}

	for i := 1; i < 99; i++ {
		itemId := fmt.Sprintf("iamaverylargeitemidthattakesupspace_%d", i)
		err := session.Query("delete from items where owner_id = ? and item_id = ?",
			ownerId, itemId).Exec()
		if err != nil {
			log.Fatalf("Did not delete docs for owner id %d, %s\n", ownerId, err)
		}
	}
}

Let's use this simple Go program to delete roughly 100,000 items:

$ for i in {1..1000}; do ./cass-delete $i; done

While running, we notice that storage goes UP a little bit:

$ du -sh /home/mwood/apache-cassandra-3.10/data/data/items/items-577a43e01bc411e7b901ddd20f079ab4/
2.7M	/home/mwood/apache-cassandra-3.10/data/data/items/items-577a43e01bc411e7b901ddd20f079ab4/

But by the time it is finished running and my super-agressive tombstone reaping settings kick in:

$ du -sh /home/mwood/apache-cassandra-3.10/data/data/items/items-577a43e01bc411e7b901ddd20f079ab4/
1.9M	/home/mwood/apache-cassandra-3.10/data/data/items/items-577a43e01bc411e7b901ddd20f079ab4/

So Cassandra can be made to do compactions on a super aggressive schedule!

A few other interesting notes:

How to find the files used by a table:

$ ./sstableutil items items
Listing files...

How to flush everything to disk:

$ ./nodetool flush

This is interesting reading that informed the settings I used for the table:

http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

In addition to apache-cassandra-3.10/bin, there is an apache-cassandra-3.10/tools/bin. Good to know!