Is it worth using custom indexing in DocumentDB?

When trying to improve performance on any code or system it's always nice to know in advance what kind of changes would make the greatest impact. Focusing on making the greatest impact with minimal work is something we seem to be fairly good at focusing on.

We faced this problem recently during our thesis writing. When asking our supervisor (from the SQL world) what to focus on they gave indexes as a suggestion. Since we really didn't know the impact of the different optimization things we thought that was a good place to start.

This post tries to document the change a few indexes can do.

How I tested

We had already developed a test suite that included Web Tests and Load Tests. These had several different characteristics. I decided to cut out the Write tests on those for indexes.
The selection of tests were run for 15 minutes (with a 1 minute warm-up) before and after the change to see the actual performance differences.

The write test

The write tests creates a new "user" (first request) it then creates a game using this newly created user. (second request)

Indexing

The standard indexing policy in DocumentDB is index everything. What we're storing is in one case users and these users have games and since we're now in this fluffy NoSQL world we store the games belonging to a user inside that user. Each such game is 300+ lines of JSON data, and nothing we're currently interested on querying against without knowing the user owning it, so we removed that from the index except three sub-values that we might need.
So we went from indexing 300+ data points to a maximum of 5 levels 'deep' to only indexing 20+ data points to a maximum of 3 levels.
Here's the resulting graph!

Index vs NoIndex chart

Yes, the solution that uses index seems to be a bit less consistent. But it's constantly ~5 times faster than the solution were we didn't specify indexes.
In case of 'pure' numbers. After 15 minutes (including 1 min warmup) the slower variant had finished 1096 tests (2 requests per test), and the one were I had specified my own indexes had completed a whopping 4843 number of them!

This is exactly the same code, just six index rules added.

tempCollection.IndexingPolicy.IncludedPaths.Add(new IndexingPath { Path = "/" });

tempCollection.IndexingPolicy.ExcludedPaths.Add(@"/""Games""/*");

tempCollection.IndexingPolicy.ExcludedPaths.Add(@"/""State""/*");

tempCollection.IndexingPolicy.IncludedPaths.Add(new IndexingPath  
{
  IndexType = IndexType.Hash,
  NumericPrecision = 3,
  StringPrecision = 3,
  Path = @"/""State""/""Stage""/?"
});

tempCollection.IndexingPolicy.IncludedPaths.Add(new IndexingPath  
{
  IndexType = IndexType.Hash,
  StringPrecision = 3,
  Path = @"/""State""/""Challenger""/""Id""/?"
});

tempCollection.IndexingPolicy.IncludedPaths.Add(new IndexingPath  
{
  IndexType = IndexType.Hash,
  NumericPrecision = 3,
  Path = "/\"State\"/\"CompatibilityVersion\"/?"
});

Brand new world, really?

Perhaps such a new world after all 'They' all say that it's now a brand new world, we can store schema free data and that we don't really need to care that much about how the data actually looks. Sure, from one point of view there's a lot less we need to care about. We do not need to create tables and columns in advance, but we still need to care about what we're sending in. The difference is that we now can set and manage these indexes per request, when we actually supply the data.

References: