Database substring indexes
Goals
Imagine you have an application that keeps track of domain names. And this application has millions of entries in a MySQL database. How do you efficiently find all domain names that have part of a word inside of them?
This article assumes some understanding of how SQL queries work.
Simple solution
Let's just start with the SQL filter "like":
MySQL> select fqdn from domains where fqdn like '%drown%';
+--------------+
| fqdn |
+--------------+
| rs.drown.org |
| drown.org |
+--------------+
2 rows in set (0.78 sec)
Ok, that worked but it took three quarters of a second to finish. Can we go any faster?
Indexes
Let's look at the indexes on this table:
MySQL> show create table domains \G
*************************** 1. row ***************************
Table: domains
Create Table: CREATE TABLE `domains` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fqdn` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `domains_fqdn` (`fqdn`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
The domains_fqdn index provides a quick way to lookup the row if we know the fqdn or what the fqdn starts with.
Using the "like" filter when we know what the domain starts with is quick, for example:
MySQL> select fqdn from domains where fqdn like 'rs.drown%';
+--------------+
| fqdn |
+--------------+
| rs.drown.org |
+--------------+
1 row in set (0.00 sec)
Why is there such a big difference between the two queries?
Searching for 'rs.drown%' took under 10ms, while searching for '%drown%' took 780ms. It was over 78 times slower.
The SQL command "explain" will tell us more about the differences in our queries:
MySQL> explain select fqdn from domains where fqdn like '%drown%' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: domains
type: index
possible_keys: NULL
key: domains_fqdn
key_len: 768
ref: NULL
rows: 2000866
Extra: Using where; Using index
1 row in set (0.00 sec)
MySQL> explain select fqdn from domains where fqdn like 'rs.drown%' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: domains
type: range
possible_keys: domains_fqdn
key: domains_fqdn
key_len: 768
ref: NULL
rows: 1
Extra: Using where; Using index
1 row in set (0.01 sec)
These are both using the domains_fqdn index, but there's a dramatic difference in number of rows each is looking through.
Why is there such a big difference in number of rows?
The "domains_fqdn" index maintains a sorted list of fqdn values. The database can use a very efficient binary search to find the matching rows as long as the comparison can eliminate half of the rows in the index using the sort order. This only works in this case when the comparison starts at the first character. Otherwise, the entire index has to be read because the relevant rows could be located anywhere inside of the index.
What can be done?
This is a job for a trigram index. With trigrams, the string is broken down into every three character combination inside it. For example "drown" has the trigrams "dro", "row", and "own". So any domain with the word "drown" in it will have entries for "dro", "row", and "own" in the trigram index.
This is a feature built into PostgreSQL. For MySQL, it's in version 5.7 and later. But let's go through implementing a version of it ourselves to see what's happening under the hood.
Let's start with a table to store this index:
MySQL> show create table domain_substrings \G
*************************** 1. row ***************************
Table: domain_substrings
Create Table: CREATE TABLE `domain_substrings` (
`trigram` varchar(3) NOT NULL,
`id` int(11) NOT NULL,
PRIMARY KEY (`trigram`,`id`),
KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
The work of maintaining this index is too large for this blog post and is left as an exercise for the reader.
How do we use this index in a query?
Let's start with finding all possible domain IDs that might match our query:
MySQL> select
id
from
domain_substrings
where
trigram in ('dro','row','own')
group by id;
+---------+
| id |
+---------+
| 5 |
| 6 |
| 7 |
| 11 |
...
+---------+
15468 rows in set (0.05 sec)
Ok, that's better than 2 million, but that's still a lot of results. Can we narrow that down a bit?
MySQL> select
s1.id
from
domain_substrings s1
inner join domain_substrings s2 using (id)
inner join domain_substrings s3 using (id)
where
s1.trigram='dro' and s2.trigram='row' and s3.trigram='own';
+--------+
| id |
+--------+
| 111987 |
| 326022 |
| 338147 |
| 338148 |
| 430802 |
| 488584 |
| 603300 |
| 632624 |
| 637830 |
+--------+
9 rows in set (0.01 sec)
This uses 3 self-joins of the domain_substrings index to find the domain IDs that have all three trigrams in them. This found 9 possible matches in only 10ms. All we need now is to join those results to the domain table to find the matching domains and print out the fqdns:
MySQL> select
domains.fqdn
from
domain_substrings s1
inner join domain_substrings s2 using (id)
inner join domain_substrings s3 using (id)
inner join domains using (id)
where
s1.trigram='dro' and s2.trigram='row' and s3.trigram='own' and
domains.fqdn like '%drown%';
+--------------+
| fqdn |
+--------------+
| drown.org |
| rs.drown.org |
+--------------+
2 rows in set (0.01 sec)
The whole query now only takes 10ms compared to the 780ms before.
What are some other ways to solve this?
If your dataset is large enough, you might want to do this outside of a database. Apache Lucene can do n-gram token indexes and Elasticsearch is a common API layer on top of Lucene that scales to very large datasets.
I did not consider MySQL's older (non n-gram based) Full-Text index because it is designed for documents with spaces between words, which is not how domain names are structured.
Summary
We explored trigram indexes and how they can dramatically speed up substring searches on a list of domain names. Hopefully you've learned something about how trigrams work.