Jason Thorsness

github
github icon
linkedin
linkedin icon
twitter
twitter icon
hi
22Jan 31 25

Fast Queries Do Less

One universal technique for getting software systems to perform faster is to make them do less work. Consider two broad approaches:

  1. Ask for less
  2. Work efficiently

Both are applicable to querying databases. In this article I’ll cover how to apply these concepts to improve latency and reduce load on a database.

Ask For Less

In SQL asking for less means adjusting your queries to return:

  1. Fewer rows
  2. Fewer columns
  3. Less data from each column

Consider a profile of you, my reader, and all other readers of this blog. Each row has a numeric id primary key, an assortment of columns like name, created, etc., and a properties column that contains JSON. I want to retrieve your favorite color and animal to render a friendly avatar.

One approach is to select everything and do the filtering in the client application.

SELECT *
FROM readers;

MEDIAN LATENCY: 💀

With the 20 million synthetic profiles I loaded, this crashes my backend. You might think, “nobody would ever do this”, but you would be wrong 😅. Regardless, let’s fix this query.

Fewer Rows

Enter the most fundamental way of asking for less in SQL: the WHERE clause. Here we only want one record, so it’s straightforward.

SELECT *
FROM readers
WHERE id = 123456;

MEDIAN LATENCY: 17.275 ms

Now I am asking for only one row, millionths less data. This is a good start.

Fewer columns

Requesting fewer columns reduces network transfer cost regardless of your flavor of database. For column stores this is more important because it scopes disk IO, decoding, and row reconstruction to only selected columns. Let’s modify our query.

SELECT properties
FROM readers
WHERE id = 123456;

MEDIAN LATENCY: 12.300 ms

Limiting the query to properties improves performance. But can we do better?

Less data from each column

Returning only a portion of each column further reduces network transfer cost. This often can make a big difference for JSON columns. For example here is some JSON from properties:

{
  "age": 18,
  "balance": 3805.24,
  "boolean": false,
  "city": "Austinborough",
  "company": "Mcdonald, Blevins and Wood",
  "favorite": {
    "animal": "Elephant",
    "color": "Red"
  },
  "hobby": "surface",
  "ip": "203.98.0.243",
  "job": "Multimedia specialist",
  "long_string": "...",
  "random_string": "Particularly public.",
  "score": 18.7,
  "status": "active",
  "uuid": "e07e229c-2e1f-4410-b804-2fea3b24ae1d"
}

Retrieving just the favorite animal and color from this blob significantly reduces the amount of data transferred. Beyond transfer, at least in SingleStore JSON columns are stored with each path “shredded” into a separate columnar structure, so scoping the paths requested further reduces disk IO, decoding, and row reconstruction cost. Let’s modify our query one more time.

SELECT
    properties::favorite::$animal AS animal,
    properties::favorite::$color AS color
FROM readers
WHERE id = 123456;

MEDIAN LATENCY:  3.706 ms

Now I’m fetching just one row, one column, and only the needed data from that column. We’ve taken the query from ~17 ms to ~12 ms to ~4 ms. This query has reached its final form in asking only for what it needs.

Here’s your favorite animal in your favorite color.

dog

Work efficiently

Working efficiently for databases involves minimizing the data processed by a query. If the result of a query only derives from a hundred rows, only those rows should be touched during execution. Achieving this ideal is dependent on how data is organized.

Each database provides different controls for data organization. In SingleStore we can control data layout with partitions (SHARD KEY) and segments (SORT KEY). We can also create hash indexes (KEY) to further limit the number of rows involved in a query.

Partitions (SHARD KEY)

A good shard key evenly distributes the data across partitions and can be used with an equality constraint at query time. A unique identifier is often the best choice. That’s why in SingleStore, if you don’t specify a shard key, the primary key is used by default. So, our query above is already optimized to go to a single partition, apparent when the EXPLAIN starts with Gather partitions:single.

It’s difficult to select a good SHARD KEY other than a unique or mostly-unique identifier. If you pick a time-based column, you’ll often have hot spots as the data inserted and retrieved will correlate to the current time. If you pick something coarse-grained like company or country, you’ll often have hot spots because some companies and countries are larger than others.

Consider using an identifier that is shared by multiple tables you join — for example, if you have two tables A and B where A has id and B has a_id to accomplish the join, shard A on id and B on a_id to limit the cross-partition work required.

Segments (SORT KEY)

Our query above is already using a unique PRIMARY KEY index so the sort key doesn’t matter much. To explore the effect of a SORT KEY, let’s switch queries to some analytics I might want to perform over the readers table. First, here’s the table definition we’re starting with:

CREATE TABLE readers (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    long_string LONGTEXT NOT NULL
);

For a given cohort of readers that joined within a certain date range, and who share a certain favorite color and animal, I might want to find the top three by likes. This query could look like this:

SELECT id, likes
FROM readers
WHERE
    properties::favorite::$color = 'blue'
    AND properties::favorite::$animal = 'dog'
    AND created > '2022-02-01'
    AND created < '2022-02-08'
ORDER BY likes DESC
LIMIT 3;

MEDIAN LATENCY: 56.764 ms

This is already pretty fast in SingleStore due to the columnar execution, but let’s peek a little behind the scenes to see what the database is doing.

PROFILE SELECT id
FROM readers
WHERE
    properties::favorite::$color = 'blue'
    AND properties::favorite::$animal = 'dog'
    AND created > '2022-02-01'
    AND created < '2022-02-08'
ORDER BY likes DESC
LIMIT 3;
SHOW PROFILE;

The key problem from the profile is segments_scanned: 111 segments_skipped: 0. This query is touching 111 segments. We can do much better by specifying created as the SORT KEY for this table. A time-based column is often an excellent choice for SORT KEY because queries filter on time using ranges, and insertions often come in close to append-order which minimizes the reordering required. Here’s the updated definition:

CREATE TABLE readers (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    long_string LONGTEXT NOT NULL,
    SORT KEY(created)
);

Now, our time has improved nearly 10x, because the query is doing much less work, with segments_scanned: 16 segments_skipped: 80.

MEDIAN LATENCY:  6.192 ms

Rows (KEY and other indexes)

We’re still doing more work than necessary - we’re decoding the properties for every row in the segment. In this case I want to limit it to a specific favorite color and animal. Since these are JSON columns it’s a two-step process. First, create computed columns extracting the field. Then, creating a hash index on the columns. In this case since the query is using both columns, we can use a compound index.

CREATE TABLE readers (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    long_string LONGTEXT NOT NULL,
    $fav_color AS properties::favorite::$color PERSISTED LONGTEXT,
    $fav_animal AS properties::favorite::$animal PERSISTED LONGTEXT,
    KEY($fav_color, $fav_animal),
    SORT KEY(created)
);

Using this table, our time has improved 2x more, because the query is doing much less again. This can be the final form of this table, at least for these two queries we’ve seen so far.

MEDIAN LATENCY:  3.124 ms

Look at how many likes these synthetic readers have given! One day this will all be real.

idlikes
363510099604
1425544599340
1187278399035

Conclusion

Every database is different, but the principles of speeding things up by asking for less and working more efficiently are universal in pretty much any database.

  • By asking for less, we improved query latency nearly 5x.

  • By organizing our data so the database could work more efficiently, we improved query latency nearly 10x.

Even if there’s no meaningful real-world difference to an application between latency of 17 ms vs. 4 ms, or 56 ms vs. 3 ms, by using fewer resources you’ll require a less powerful database cluster, so these latency reductions can become cost reductions.

Hopefully you’ve taken away something useful from this article. To be one of the first to read others like this, or for questions/comments, please follow me on X. Thanks!

 Top 
TermsPrivacy