Why adding an index to a DB table isn’t always a good idea
When we’re on site we talk to a lot of people who are struggling with databases and how they work. I always try and help by explaining the basic of how a RDBMS works and how indexes are generally used, however there’s one thing that always requires extra discussion, as it’s contrary to commonsense, why removing indexes can sometimes increase database performance.
In this blog there’s not really room for a full discussion of Database (aka “DB” from now on) problems, however new users of databases (or rather new users of production databases with a reasonable workload) often do the following;
a) Start logging queries that don’t have indexes (or run tuning analysers like those in Microsoft SQL Server’s Management Studio)
b) Add indexes with gusto; each table starts to have 3 or 4 indexes (I won’t use the term indices today…)
c) Notice new and horrible issues, some inserts and updates take longer than before, and some selects take longer than they did before the indexes were added.
Of course some people then repeat (b) and (c) again…
Two universal truths need to be stated;
1) A database optimiser won’t magically concatenate multiple index data together and use the composite index for a super-quick lookup. Instead you should estimate that a DB will only select a single index to use, and that’ll be based off the statistics of the index “selectivity” (aka “granularity” in some quarters).
2) Because we have more intelligence than machines (and we know the business domain) we can do things that the machine optimiser can’t. This does mean that in some instances it’s better to not use an index for a select statement, or even (in some difficult spots) tell the query engine to use a specific index for it’s lookup.
I should also add another “Universal Truth” that’s not quite true (it’s kind of the Newtonian Physics of the DB Tuning world - it’s not quite true but it’s near)
3) Adding an Index has a cost for each insert and update (and when those indexes need to be rebuilt). Yes, it’s not good to have superfluous Indexes but this is often over-stated to a point where people are scared to add indexes, even when they’re sure it’s a good idea.
Right, first we’ll attack UT (Universal Truth) 1;
Universal Truth 1
“A database optimiser won’t magically concatenate multiple index data together and use the composite index for a super-quick lookup”
Databases are created to handle data, and they do it well. They maintain this lead over the competition by being ruthlessly efficient and sticking to sensible rules.
A database will typically (there are exceptions but they’re rare) use a single index to look up data in a table. If there isn’t one then you’ll get a table scan (usually bad bad news if there are many rows - more on that later).
When a DB uses an index it tries to select the best route/index to cut down the number of records it then has to read from the main table and finish off it’s selection (of course if it doesn’t need to select any further from the table that’s great.. however unless the index has all the necessary return columns from the query then it’ll still need to read that table row).
So let’s create a sample table;
CREATE TABLE `propagandr`.`animal` (
`id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`animalname` VARCHAR(45) NOT NULL,
`animaltypeid` INTEGER UNSIGNED NOT NULL,
`creationusername` VARCHAR(45) NOT NULL,
`creationdate` VARCHAR(45) NOT NULL,
PRIMARY KEY(`id`)
)
/* note: this is an example - if you want to add a few thousand content lines then go for it and you can try the queries out */
Now this has no user defined indexes (it has one - notice the primary key is ‘id’ - that’s an indexed column as that’s the way it’s laid out on the disk within MySQL), so if I say “SELECT id, animalname FROM animal WHERE creationusername = ‘boris’” then we know it’ll do a table scan.
So, let’s add an index for the animaltypeid (as that links to another table)
ALTER TABLE animal ADD INDEX `Index_animaltypeid`(`animaltypeid`);
We then re-run the query, and no surprise it doesn’t get used. So we’ll add an index for the creationusername…
ALTER TABLE animal ADD INDEX `Index_creationusername`(`creationusername`);
Re-running the query we notice it’s now using the index.. winner!
So, let’s change the query to be more “realistic” - “SELECT id, animalname FROM animal WHERE animaltypeid = 3 AND creationusername = ‘boris’”. This should use both indexes right? erm. no. It’ll select the ONE index that it believes is the most selective (so if there is only two people who create records then each row has a 50% chance of being allotted to a single user - not that selective, however there might be 500 animal types).
Let’s assume we really need this query to be lightening fast; we could add a further index to include both columns
ALTER TABLE animal ADD INDEX `Index_animaltypeid_vs_creationusername `(`animaltypeid`, `creationusername`);
Now, i’ve used the business domain specific discussion earlier to establish that animaltypeid is more selective, so i’ve put that first - after all a query just based on animaltypeid can still use that index. But, what for creationusername? Well that’s largely useless and in many cases it makes more sense to do a table scan (if that doesn’t sound right then think hard about what the DB will do - it’ll look at the index, note the selectivity is very poor AND that it needs to go to the table to get the data anyway… so it’ll do a single serial read through the table rather than use the index, compile the animal id’s that match and then look them up and return them).
Hope that helps with part 1… Now on to part 2 (which is far weirder for most folk)
Universal Truth 2
“Because we have more intelligence than machines (and we know the business domain) we can do things that the machine optimiser can’t”
In the above example we used our knowledge of the data to guide which way we should build a key. We sometimes hav to do the same with respect to table data.
Let’s imagine that the above Animal table is constantly being updated (like > 10 per second), so we can’t key the data too heavily, but a query we often run is asking whether we’ve had a tiger added in the last 5 minutes (this is an example bear with me! - other examples could be a wiki update log, a money audit log etc.)
Here we can use some business domain knowledge to help us out. We know that tigers (animaltypeid 3 btw) are added pretty regularly, and that they’re a regular but largely insignificant number of entries (10 entries per sec * 60 secs in a minute * 60 minutes per hour - that’s a fair few non-tigers coming through).
Now we’ve already got an index that addresses the animaltypeid, so we figure that’s a good candidate, so we write the following SQL (if you’re not familiar with the LIMIT keyword it just does what TOP does in SQL Server);
“SELECT * FROM Animal WHERE animaltypeid = 3 /* that’s a tiger remember */ AND creationdate > DATE_ADD ( NOW(), INTERVAL -1 HOUR) LIMIT 1;”
(note: the limit 1 just returns the first one we find as we only want to know IF there was a tiger added in the last hour)
Now, if you analyse the DB you’ll see it uses the animaltypeid index, and you may assume this is good, however look at the timings and work out what it’s doing.
Using that index means it’s gone to the index, retrieved all the ids for animals that are of animaltypeid 3, then it goes to the animal table and scans those rows and if they pass the criteria it copies it out for returning to the caller. Assuming we’ve got a thousand tigers in the DB (and a 100,000 entries) then that’s a lot of Disk IO for not much.
So, we re-write it slightly to only use the table (i’ll skip the index hint that tells the DB not to use that index here but you can add it if you need it, but in a circumstance like this we’d re-write the index so it went creationusername then animaltypeid so this query wouldn’t trigger it but callers using both those criteria still get the benefits)
“SELECT * FROM animal WHERE animaltypeid = 3 and creationdate > DATE_ADD( NOW(), INTERVAL -1 HOUR) ORDER BY id DESC LIMIT 1;”
Here we using a few things, the fact that we know the table is inserted into in order (so we add the order by id DESC to force the DB to look backwards from the current page - we could have said “ORDER BY creationdate DESC” but we know that the id rises as the creationdate does, and as MySQL orders tables by the primary key it makes sense to use that), and also we know that it won’t use the animaltypeid index (as we’ve given an index hint or we’ve removed the index).
Now, this query uses no indexes for the lookup, and table scans are always bad, right? Well, you’ll find in these instances this runs very quickly as the answer is likely to be on the first page of data that’s parsed, and because it’s updated so frequently it’s also a good likelihood that it’s in memory.
So, what about if we just added an index that allows us to do that lookup quicker? Let’s define an index that does that
ALTER TABLE animal ADD INDEX `Index_animaltypeid_vs_creationdate `(`animaltypeid`, `creationdate`);
Now, that seems an ideal situation: the query now uses that index and everybody’s happy? Well not really, that index is going to be appended to all the time and that means the index will be less than optimal, plus we have to write an additional index entry for every insert that we do (or updates if we change the creationdate for some reason??). This of course is part of Universal Truth 3
Adding an index is min-maxing, you need to profile the application performance before and after and get representative results back, otherwise you may as well be applying stuff in the dark.
Hopefully the above has helped some of you. If you’ve any comments then get in touch and let us know.