Scaling CouchDB to 100 million patients

Replication is the process by which CouchDB syncs data between two databases so at the end their contents are identical. Sometimes, however, you want one database to only have a subset of the documents, for example, when users don't have permission to see all docs, or when replicating to a phone which doesn't have the space to store the entire database. This is when filtered replication is needed.

The filter is essentially just a function which returns true or false for a given document. The logic of this function is custom to your application based on your permissions model, so it needs to be entirely configurable.

In CouchDB, the replication algorithm gets a list of changes in chronological order since the last time you synced. These are filtered one by one and the appropriate changes are returned. The changed docs are then fetched in batches from the source db and stored in the target db.

In the Community Health Toolkit we use filtered replication to sync docs from the server to a client side PouchDB instance. PouchDB natively supports the CouchDB replication protocol so it works out of the box. Initially the CHT was only used on deployments with hundreds of users, but as the scale increased we needed to constantly redevelop the way we performed filtered replication.

This is the story of that journey and what I learned along the way.

v1: native functions

0-10 users

The naive implementation is to provide a Javascript function as a filter. With this approach, every doc must be fetched from disk and then serialized into JSON so it can be passed in to the filter function to determine if it should be returned or not.

function(doc, req) {
  if (doc.type === 'shared') {
    // all users can access shared docs
    return true;
  }
  if (doc.owner === req.userCtx.name) {
    // users can access their own docs
    return true;
  }
  return false;
}

Adding users means sync requests and more docs to fetch, so it has O(N²) performance. Even with only one user filtering a lot of documents is incredibly slow, and if around 10 users synced at the same time the requests would start to queue up and time out. Suffice to say, this was abandoned very early on.

One evolutionary deadend was rewriting the function in erlang. This gave roughly 3x performance improvement, which is great, but the scalability is still unacceptable.

v2: filter by doc ID

10-100 users

An alternative to using a filter function is providing a list of document IDs. This performs much better because the algorithm can determine which changes are relevant without fetching the doc from the db. This was implemented by finding every ID in the target db we could determine if any docs had changed in a way that scaled well and including the list of known IDs in the request to the server. With thousands of IDs the request started to get chunky. Furthermore the native solution had no way to fetch new documents that the client didn't know about yet.

v3: view lookups

100-1000 users

At this point we had exhausted all possibilities of the natively supported CouchDB functionality and we took the leap to augment the process. We wanted to maintain protocol compatibility so at least the client implementation wouldn't need to change. With this attempt the server intercepted the replication request and looked up the list of IDs the user was permitted to view before proxying the request to CouchDB for changes that match those IDs. The IDs were looked up using CouchDB map/reduce views which performs well and does the caching for you so repeated lookups are fast. We thought we were done.

But as we kept scaling we found the next bottleneck...

v4: custom initial replication

1k-5k users

The majority of CHT users only see documents they have created plus a small subset of docs that everyone can view. This means the changes feed is very sparse for each user, ie: if you have a thousand users, then each user is able to access only ~0.1% of docs, meaning you have to scroll through a lot of changes before finding even one that passes the filter. Therefore when a user first logs in the server has to scan through every change that has ever been made before responding to the request. After a few years of operation this means the changes request starts hanging and eventually times out, at which point the client reissues the request from the start again. When onboarding a new cohort of users for the first time this can lock up the server completely.

Given that the worst case was with users logging in for the first time to a server with millions of changes. This is a special case because the client db is empty so you don't need to worry about document deletions or merge conflicts. Once again we found ourselves having to diverge from the native implementation. Instead of using the replication protocol we implemented a custom API to get the latest version of every doc the user can access. Once that completes the native replication can begin from then on.

This solves the worse case scenario but there are still edge cases. One example is users who don't connect often and end up with a lot of changes to crawl through again. Another example is if a migration updates many documents in one go causing millions of changes to be added to the feed in one go.

v5: custom ongoing replication

5k-100k users

Now that we had solved the initial replication problem we decided to use this pattern for ongoing replication as well. The client now periodically calls a custom API that returns the latest version number of all docs the user is allowed to access. Then the client finds which of these versions differ from the version in the local DB and fetches and merges these in batches. This completely removes the need for querying the changes feed which was the ultimate bottleneck for large scale deployments.

This version finally supports tens of thousands of users updating millions of documents every month in production.

DB per user

CouchDB supports db-per-user pattern where each user has their own DB instance. This removes the need to use filtered replication because the source and target DB have the same document access. For the CHT this would not have worked well because many docs were shared with other users, for example, all the user's docs should be accessible to their supervisor. If user-per-db were used, then the filtering would still have to be applied between every db to every other db on the server, which has the same scalability issue as more users are added.

Per document access control

One as yet unreleased feature is per document access control which adds permissions to the doc for natively supporting doc level access. One downside is it stores access in a field on the doc which means if the permissions are changed then multiple documents will need to be updated to reflect the new access, thus adding a large number of lines to the changes feed. Watch this space.

Throwing hardware at it

One of the first attempts at solving replication throughput was increasing the number of cores available to CouchDB. Initially this was the latest and greatest v2, which maxed out at one CPU per data shard which defaults to 8, so there's no improvement past about 10 cores. It's difficult to increase the number of shards with existing data, so you definitely want to think carefully about how many shards you think you'll need before getting started. The latest CouchDB v3 improves the multi-threading to make better use of additional cores. It also made it easier to increase the number of shards through resharding.

The next idea was to increase the memory as instances were using all available RAM, but CouchDB kept consuming 100% of the RAM that was given. For a database this is ideal, because we needed it to cache as much as it was allowed. This made no noticeable improvement to overall scalability, because the problem was CPU and IO bound.

Clustering works extremely well in CouchDB for example when testing a three node cluster we measured a 3x performance improvement. Much like with shards it's difficult to change the number of nodes in the cluster once in production, so it pays to work out what scale you'll need before going live. The downside is that clustering significantly increases deployment complexity. Additionally to get reliable uptime CouchDB uses replicas of the data on different nodes so if one node is unavailable the server can still respond. This doubles the disk usage.

Using more shards and clustered nodes multiplied the performance and scalability of replication, and were complementary with the algorithmic changes above.

Replacing CouchDB

Every time a new version was discussed we evaluated whether CouchDB was the right DB for our use case. Migrating to a different DB is always disruptive. This is especially true if using protocols which couple the server and client side DBs, which mean you either need to replace both at the same time, or reimplement the protocol on top of the new DB. It's also very difficult to test performance and scalability before you've done the work, so it's hard to know if the effort will pay off or not.

With v5 the algorithm is no longer using the CouchDB replication protocol which makes it possible to change the server DB without changing the client implementation at all. This was an intentional secondary benefit to switching to using a custom algorithm.

Gareth Bowen