One universal technique for getting software systems to perform faster is to make them do less work. Consider two broad approaches:
Both are applicable to querying databases. In this article I’ll cover how to apply these concepts to improve latency and reduce load on a database.
In SQL asking for less means adjusting your queries to return:
Consider a profile of you, my reader, and all other readers of this blog. Each row has a numeric
id
primary key, an assortment of columns like name
, created
, etc., and a properties
column
that contains JSON. I want to retrieve your favorite color and animal to render a friendly avatar.
One approach is to select everything and do the filtering in the client application.
SELECT *
FROM readers;
MEDIAN LATENCY: 💀
With the 20 million synthetic profiles I loaded, this crashes my backend. You might think, “nobody would ever do this”, but you would be wrong 😅. Regardless, let’s fix this query.
Enter the most fundamental way of asking for less in SQL: the WHERE
clause. Here we only want one
record, so it’s straightforward.
SELECT *
FROM readers
WHERE id = 123456;
MEDIAN LATENCY: 17.275 ms
Now I am asking for only one row, millionths less data. This is a good start.
Requesting fewer columns reduces network transfer cost regardless of your flavor of database. For column stores this is more important because it scopes disk IO, decoding, and row reconstruction to only selected columns. Let’s modify our query.
SELECT properties
FROM readers
WHERE id = 123456;
MEDIAN LATENCY: 12.300 ms
Limiting the query to properties
improves performance. But can we do better?
Returning only a portion of each column further reduces network transfer cost. This often can make a
big difference for JSON columns. For example here is some JSON from properties
:
{
"age": 18,
"balance": 3805.24,
"boolean": false,
"city": "Austinborough",
"company": "Mcdonald, Blevins and Wood",
"favorite": {
"animal": "Elephant",
"color": "Red"
},
"hobby": "surface",
"ip": "203.98.0.243",
"job": "Multimedia specialist",
"long_string": "...",
"random_string": "Particularly public.",
"score": 18.7,
"status": "active",
"uuid": "e07e229c-2e1f-4410-b804-2fea3b24ae1d"
}
Retrieving just the favorite animal and color from this blob significantly reduces the amount of data transferred. Beyond transfer, at least in SingleStore JSON columns are stored with each path “shredded” into a separate columnar structure, so scoping the paths requested further reduces disk IO, decoding, and row reconstruction cost. Let’s modify our query one more time.
SELECT
properties::favorite::$animal AS animal,
properties::favorite::$color AS color
FROM readers
WHERE id = 123456;
MEDIAN LATENCY: 3.706 ms
Now I’m fetching just one row, one column, and only the needed data from that column. We’ve taken the query from ~17 ms to ~12 ms to ~4 ms. This query has reached its final form in asking only for what it needs.
Here’s your favorite animal in your favorite color.
Working efficiently for databases involves minimizing the data processed by a query. If the result of a query only derives from a hundred rows, only those rows should be touched during execution. Achieving this ideal is dependent on how data is organized.
Each database provides different controls for data organization. In SingleStore we can control data
layout with partitions (SHARD KEY
) and segments (SORT KEY
). We can also create hash indexes
(KEY
) to further limit the number of rows involved in a query.
SHARD KEY
)A good shard key evenly distributes the data across partitions and can be used with an equality
constraint at query time. A unique identifier is often the best choice. That’s why in SingleStore,
if you don’t specify a shard key, the primary key is used by default. So, our query above is already
optimized to go to a single partition, apparent when the EXPLAIN
starts with
Gather partitions:single
.
It’s difficult to select a good SHARD KEY
other than a unique or mostly-unique identifier. If you
pick a time-based column, you’ll often have hot spots as the data inserted and retrieved will
correlate to the current time. If you pick something coarse-grained like company or country, you’ll
often have hot spots because some companies and countries are larger than others.
Consider using an identifier that is shared by multiple tables you join — for example, if you have
two tables A
and B
where A
has id
and B
has a_id
to accomplish the join, shard A
on
id
and B
on a_id
to limit the cross-partition work required.
SORT KEY
)Our query above is already using a unique PRIMARY KEY
index so the sort key doesn’t matter much.
To explore the effect of a SORT KEY
, let’s switch queries to some analytics I might want to
perform over the readers table. First, here’s the table definition we’re starting with:
CREATE TABLE readers (
id INT PRIMARY KEY,
name LONGTEXT NOT NULL,
created DATETIME(6) NOT NULL,
likes BIGINT NOT NULL,
properties JSON,
long_string LONGTEXT NOT NULL
);
For a given cohort of readers that joined within a certain date range, and who share a certain favorite color and animal, I might want to find the top three by likes. This query could look like this:
SELECT id, likes
FROM readers
WHERE
properties::favorite::$color = 'blue'
AND properties::favorite::$animal = 'dog'
AND created > '2022-02-01'
AND created < '2022-02-08'
ORDER BY likes DESC
LIMIT 3;
MEDIAN LATENCY: 56.764 ms
This is already pretty fast in SingleStore due to the columnar execution, but let’s peek a little behind the scenes to see what the database is doing.
PROFILE SELECT id
FROM readers
WHERE
properties::favorite::$color = 'blue'
AND properties::favorite::$animal = 'dog'
AND created > '2022-02-01'
AND created < '2022-02-08'
ORDER BY likes DESC
LIMIT 3;
SHOW PROFILE;
The key problem from the profile is segments_scanned: 111 segments_skipped: 0
. This query is
touching 111 segments. We can do much better by specifying created
as the SORT KEY
for this
table. A time-based column is often an excellent choice for SORT KEY
because queries filter on
time using ranges, and insertions often come in close to append-order which minimizes the reordering
required. Here’s the updated definition:
CREATE TABLE readers (
id INT PRIMARY KEY,
name LONGTEXT NOT NULL,
created DATETIME(6) NOT NULL,
likes BIGINT NOT NULL,
properties JSON,
long_string LONGTEXT NOT NULL,
SORT KEY(created)
);
Now, our time has improved nearly 10x, because the query is doing much less work, with
segments_scanned: 16 segments_skipped: 80
.
MEDIAN LATENCY: 6.192 ms
KEY
and other indexes)We’re still doing more work than necessary - we’re decoding the properties for every row in the segment. In this case I want to limit it to a specific favorite color and animal. Since these are JSON columns it’s a two-step process. First, create computed columns extracting the field. Then, creating a hash index on the columns. In this case since the query is using both columns, we can use a compound index.
CREATE TABLE readers (
id INT PRIMARY KEY,
name LONGTEXT NOT NULL,
created DATETIME(6) NOT NULL,
likes BIGINT NOT NULL,
properties JSON,
long_string LONGTEXT NOT NULL,
$fav_color AS properties::favorite::$color PERSISTED LONGTEXT,
$fav_animal AS properties::favorite::$animal PERSISTED LONGTEXT,
KEY($fav_color, $fav_animal),
SORT KEY(created)
);
Using this table, our time has improved 2x more, because the query is doing much less again. This can be the final form of this table, at least for these two queries we’ve seen so far.
MEDIAN LATENCY: 3.124 ms
Look at how many likes these synthetic readers have given! One day this will all be real.
id | likes |
---|---|
3635100 | 99604 |
14255445 | 99340 |
11872783 | 99035 |
Every database is different, but the principles of speeding things up by asking for less and working more efficiently are universal in pretty much any database.
By asking for less, we improved query latency nearly 5x.
By organizing our data so the database could work more efficiently, we improved query latency nearly 10x.
Even if there’s no meaningful real-world difference to an application between latency of 17 ms vs. 4 ms, or 56 ms vs. 3 ms, by using fewer resources you’ll require a less powerful database cluster, so these latency reductions can become cost reductions.
Hopefully you’ve taken away something useful from this article. To be one of the first to read others like this, or for questions/comments, please follow me on X. Thanks!
I loaded the data into a SingleStore S-2 cluster in the AWS US-EAST-1 region (low latency to SingleStore’s default Notebooks location) in a SingleStore Trial Account. I didn’t use the SingleStore Shared Tier because it limits the amount of data and is more noisy for benchmarking. The data load script and procedure was based on loadit customized by o3-mini.
Here’s my generation cell content:
import singlestoredb as s2
import io
import random
import os
import json
import base64
from faker import Faker
NUM_ROWS = 20_000_000
TARGET_TABLE_NAME = "readers"
def setup(cur):
cur.execute(f"DROP TABLE IF EXISTS {TARGET_TABLE_NAME}")
cur.execute(
f"""
CREATE TABLE {TARGET_TABLE_NAME} (
id INT PRIMARY KEY,
name LONGTEXT NOT NULL,
created DATETIME(6) NOT NULL,
likes BIGINT NOT NULL,
properties JSON,
long_string LONGTEXT NOT NULL
)
"""
)
with s2.connect(connection_url) as conn:
with conn.cursor() as cur:
setup(cur)
faker = Faker(use_weighting=False)
def generate(i):
# Generate a base64-encoded string of 1536 random bytes (~2048 characters)
long_str = base64.b64encode(os.urandom(1536)).decode('utf-8')
animals = ["Cat", "Dog", "Rabbit", "Elephant", "Lion", "Tiger"]
properties = {
"favorite": {
"color": faker.color_name(),
"animal": random.choice(animals)
},
"age": random.randint(18, 90),
"city": faker.city(),
"job": faker.job(),
"company": faker.company(),
"hobby": faker.word(),
"ip": faker.ipv4(),
"status": random.choice(["active", "inactive"]),
"score": round(random.uniform(0, 100), 2),
"uuid": faker.uuid4(),
"boolean": random.choice([True, False]),
"random_string": faker.text(max_nb_chars=20),
"balance": round(random.uniform(1000, 5000), 2),
"long_string": long_str
}
return [
i,
faker.name(),
faker
.date_time_between(start_date='-3y', end_date='now')
.strftime('%Y-%m-%d %H:%M:%S.%f'),
random.randint(1, 100000),
json.dumps(properties),
long_str
]
Running the above gave me:
Total rows: 20000000
Elapsed time: 1622s
Rows added: 20000000
Rows/sec: 12328
Estimated time remaining: 0.0s
done
I then made sure everything was fully optimized:
OPTIMIZE TABLE readers FULL
And created the alternative tables:
CREATE TABLE readers_sort (
id INT PRIMARY KEY,
name LONGTEXT NOT NULL,
created DATETIME(6) NOT NULL,
likes BIGINT NOT NULL,
properties JSON,
SORT KEY(created)
) AS SELECT id, name, created, likes, properties FROM readers;
CREATE TABLE readers_sort_keys (
id INT PRIMARY KEY,
name LONGTEXT NOT NULL,
created DATETIME(6) NOT NULL,
likes BIGINT NOT NULL,
properties JSON,
$fav_color AS properties::favorite::$color PERSISTED LONGTEXT,
$fav_animal AS properties::favorite::$animal PERSISTED LONGTEXT,
KEY($fav_color, $fav_animal),
SORT KEY(created)
) AS SELECT id, name, created, likes, properties FROM readers;
I ran queries with this function (twice for each, taking the second):
import random
import time
import statistics
ITERATIONS = 1000
execution_times = []
with s2.connect(connection_url) as conn:
with conn.cursor() as cur:
for _ in range(ITERATIONS):
random_id = random.randint(0, 20000000)
query = f"""
SELECT *
FROM readers
WHERE id = {random_id};
"""
start = time.perf_counter()
cur.execute(query)
execution_times.append(time.perf_counter() - start)
median_time = statistics.median(execution_times)
print(f"Median execution time over {ITERATIONS} runs: {median_time:.6f} seconds")
That’s it! Create your own SingleStore Trial Account and give it a try yourself, or maybe see how this applies to some other database.