Skip to main content

How I tamed a complex DB structure with an index service

So, i had to create an API layer on top of an existing (but well known) codebase.
Looks easy, huh? Yes, that’s what I thought, too.
Unfortunately, things were not so easy: as soon as I start looking into details, I found some traps.

Traps list

JOIN hell

The underlying entity map is pretty wide. There are many entities, all somehow related, and for some filters you need to apply a long JOIN chain.

Soft-delete master

The system uses a single soft-delete table, with the informations about deleted records for all entities. So this table had to be joined multiple times in each query.

Relations maze

Some entities have many-to-many relations.
There relations are stored in a separate table where each entity id / entity type couple is linked to another couple.
Furthermore you cannot even rely on a specific order: relation could be expressed in both direction, so you need to duplicate each query.

Very big data

The icing on the cake is that there are lots and lots of records. So, obviously, you are not allowed to make poor design decisions, as this will have a huge impact on performance.

So: without creativity, I would have built some methods with an infinite chain of JOINs, which makes a query unreadable and unmaintainable, and most likely a guarantee of bad execution.
You can see that the scenario was quite grim.

Nice solution, but not nice at all

One solution we experimented with in a recent project is to follow the CQRS principles and build a reading model.

This means I could run a background process to de-normalize all the data into huge read tables, where a filter doesn’t require any joins.

But this also has its disadvantages: updating the reading model is not instantaneous and requires some resources.

You need to take all possible places in the code where data is updated and fire an event to update the read model.

Or you can do it at the database level, but then you have some logic external to the application, which could prove cumbersome in the long run. Some update logic is also quite complex.

And you have to prevent multiple parallel executions.

I still believe that CQRS is a great concept, but it may take some time to build a usable implementation.

The working solution

As mentioned above, much of the complexity of the queries was due to the need to apply filters that were many JOINs away from the requested resource table.

So, as a possible evolution of the idea of reading tables, I thought of just building an index reading table.

In other words, I would de-normalize all relationships and create one or more tables with only related IDs from different tables. So I would do 2 queries:

the first in this index table to get all the IDs of an entity related to another entity ID
the second on the requested resources table, using this list of IDs as a simple WHERE IN (…)clause

Now I was going somewhere.

But before we started building this system, it evolved even further into a series of thoughts:

I have already experienced that many small queries can be faster than a single large query.
So I could use many small index tables instead of a single large one.
And since I use many small indexes, they change less often and I can cache each result for a while.
And since I cache the result, why create tables? I can just get the result and cache it.

Enter the IndexService.

The index service

The index service contains a set of methods with a common structure. This is an example:

public function getFooIdsByBar ($barValue)
  $cacheValue = $this->cache->getValue('getFooIdsByBar'. $barValue);
  if ($cacheValue) {
    return $cacheValue;

  $query = $this->connection
    ->select(" AS id")
    ->from("foo", "f")
    ->innerJoin("f", "softdeleted", "s", " = AND s.type = 'foo'")
    ->andWhere("s.deleted = 0")
    ->innerJoin("f", "relazioni", "r", " = AND r.type = 'foo'"
    ->andWhere("r.relatedId = '$ barValue'")
    ->andWhere("r.relatedType = 'bar')

  $id = $this->getIds($query);

    'getFooIdsByBar'. $barValue, $ids

  return $id;

Basically, each method starts by checking whether the cache already has the requested value, and if so, returns it. If not, create the query, get the value, and cache it for future use. There are many methods in the service; some of them may even use other methods from the same service to get filters.

The advantages


As a developer, the biggest win of this solution is that it each query in the application is really easy to understand. No more than a couple of JOINs in each query; most of them retrieve a single field which is the ID. Comparing to what I had to build at the start of the project, before realizing it was insane, this was a piece of cake.


The really nice thing, though, is that you win also from a performance point of view. Each method is cached separately. That means you could easily set different expiration times for each method, according to how often that data is changed and/or how important it is to have a really up-to-date result.


Since each query is so small, it will be run very quickly without loading the database and – as I mentioned already – many small queries may be run in a shorter time than a single big one. But that is not all: you cache all results, so the database is even less used. Isn’t that great?


Many small and simple methods are definitely easier to test. The most difficult part was building the fixtures data 🙂

The disadvantages

Honestly, I did not find any so far.

Of course, if you have some wrong data in the response and if you want to debug why, you may have to follow a few methods calls. But on the other hand, all of these methods are simpler to debug and to test, so that evens out in the end.

Currently I don’t know it this paradigm could be successfully applied to other systems. This was a very special case, with such a complex underlying database structure that it was mandatory to find a creative solution.

But I felt so good while I developed and extended the index system, that somehow I guess it will not be its last appearance.