Fast Queries Do Less

Optimizing Your Queries

One universal technique for getting software systems to perform faster is to make them do less work. Consider two broad approaches:

Ask for less
Work efficiently

Both are applicable to querying databases. In this article I’ll cover how to apply these concepts to improve latency and reduce load on a database.

Ask For Less

In SQL asking for less means adjusting your queries to return:

Fewer rows
Fewer columns
Less data from each column

Consider a profile of you, my reader, and all other readers of this blog. Each row has a numeric id primary key, an assortment of columns like name, created, etc., and a properties column that contains JSON. I want to retrieve your favorite color and animal to render a friendly avatar.

One approach is to select everything and do the filtering in the client application.

SELECT *
FROM readers;

MEDIAN LATENCY: 💀

With the 20 million synthetic profiles I loaded, this crashes my backend. You might think, “nobody would ever do this”, but you would be wrong 😅. Regardless, let’s fix this query.

Fewer Rows

Enter the most fundamental way of asking for less in SQL: the WHERE clause. Here we only want one record, so it’s straightforward.

SELECT *
FROM readers
WHERE id = 123456;

MEDIAN LATENCY: 17.275 ms

Now I am asking for only one row, millionths less data. This is a good start.

Fewer columns

Requesting fewer columns reduces network transfer cost regardless of your flavor of database. For column stores this is more important because it scopes disk IO, decoding, and row reconstruction to only selected columns. Let’s modify our query.

SELECT properties
FROM readers
WHERE id = 123456;

MEDIAN LATENCY: 12.300 ms

Limiting the query to properties improves performance. But can we do better?

Less data from each column

Returning only a portion of each column further reduces network transfer cost. This often can make a big difference for JSON columns. For example here is some JSON from properties:

{
  "age": 18,
  "balance": 3805.24,
  "boolean": false,
  "city": "Austinborough",
  "company": "Mcdonald, Blevins and Wood",
  "favorite": {
    "animal": "Elephant",
    "color": "Red"
  },
  "hobby": "surface",
  "ip": "203.98.0.243",
  "job": "Multimedia specialist",
  "long_string": "...",
  "random_string": "Particularly public.",
  "score": 18.7,
  "status": "active",
  "uuid": "e07e229c-2e1f-4410-b804-2fea3b24ae1d"
}

Retrieving just the favorite animal and color from this blob significantly reduces the amount of data transferred. Beyond transfer, at least in SingleStore JSON columns are stored with each path “shredded” into a separate columnar structure, so scoping the paths requested further reduces disk IO, decoding, and row reconstruction cost. Let’s modify our query one more time.

SELECT
    properties::favorite::$animal AS animal,
    properties::favorite::$color AS color
FROM readers
WHERE id = 123456;

MEDIAN LATENCY:  3.706 ms

Now I’m fetching just one row, one column, and only the needed data from that column. We’ve taken the query from ~17 ms to ~12 ms to ~4 ms. This query has reached its final form in asking only for what it needs.

Here’s your favorite animal in your favorite color.

Work efficiently

Working efficiently for databases involves minimizing the data processed by a query. If the result of a query only derives from a hundred rows, only those rows should be touched during execution. Achieving this ideal is dependent on how data is organized.

Each database provides different controls for data organization. In SingleStore we can control data layout with partitions (SHARD KEY) and segments (SORT KEY). We can also create hash indexes (KEY) to further limit the number of rows involved in a query.

Partitions (`SHARD KEY`)

A good shard key evenly distributes the data across partitions and can be used with an equality constraint at query time. A unique identifier is often the best choice. That’s why in SingleStore, if you don’t specify a shard key, the primary key is used by default. So, our query above is already optimized to go to a single partition, apparent when the EXPLAIN starts with Gather partitions:single.

It’s difficult to select a good SHARD KEY other than a unique or mostly-unique identifier. If you pick a time-based column, you’ll often have hot spots as the data inserted and retrieved will correlate to the current time. If you pick something coarse-grained like company or country, you’ll often have hot spots because some companies and countries are larger than others.

Consider using an identifier that is shared by multiple tables you join — for example, if you have two tables A and B where A has id and B has a_id to accomplish the join, shard A on id and B on a_id to limit the cross-partition work required.

Segments (`SORT KEY`)

Our query above is already using a unique PRIMARY KEY index so the sort key doesn’t matter much. To explore the effect of a SORT KEY, let’s switch queries to some analytics I might want to perform over the readers table. First, here’s the table definition we’re starting with:

CREATE TABLE readers (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    long_string LONGTEXT NOT NULL
);

For a given cohort of readers that joined within a certain date range, and who share a certain favorite color and animal, I might want to find the top three by likes. This query could look like this:

SELECT id, likes
FROM readers
WHERE
    properties::favorite::$color = 'blue'
    AND properties::favorite::$animal = 'dog'
    AND created > '2022-02-01'
    AND created < '2022-02-08'
ORDER BY likes DESC
LIMIT 3;

MEDIAN LATENCY: 56.764 ms

This is already pretty fast in SingleStore due to the columnar execution, but let’s peek a little behind the scenes to see what the database is doing.

PROFILE SELECT id
FROM readers
WHERE
    properties::favorite::$color = 'blue'
    AND properties::favorite::$animal = 'dog'
    AND created > '2022-02-01'
    AND created < '2022-02-08'
ORDER BY likes DESC
LIMIT 3;
SHOW PROFILE;

The key problem from the profile is segments_scanned: 111 segments_skipped: 0. This query is touching 111 segments. We can do much better by specifying created as the SORT KEY for this table. A time-based column is often an excellent choice for SORT KEY because queries filter on time using ranges, and insertions often come in close to append-order which minimizes the reordering required. Here’s the updated definition:

CREATE TABLE readers (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    long_string LONGTEXT NOT NULL,
    SORT KEY(created)
);

Now, our time has improved nearly 10x, because the query is doing much less work, with segments_scanned: 16 segments_skipped: 80.

MEDIAN LATENCY:  6.192 ms

Rows (`KEY` and other indexes)

We’re still doing more work than necessary - we’re decoding the properties for every row in the segment. In this case I want to limit it to a specific favorite color and animal. Since these are JSON columns it’s a two-step process. First, create computed columns extracting the field. Then, creating a hash index on the columns. In this case since the query is using both columns, we can use a compound index.

CREATE TABLE readers (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    long_string LONGTEXT NOT NULL,
    $fav_color AS properties::favorite::$color PERSISTED LONGTEXT,
    $fav_animal AS properties::favorite::$animal PERSISTED LONGTEXT,
    KEY($fav_color, $fav_animal),
    SORT KEY(created)
);

Using this table, our time has improved 2x more, because the query is doing much less again. This can be the final form of this table, at least for these two queries we’ve seen so far.

MEDIAN LATENCY:  3.124 ms

Look at how many likes these synthetic readers have given! One day this will all be real.

id	likes
3635100	99604
14255445	99340
11872783	99035

Conclusion

Every database is different, but the principles of speeding things up by asking for less and working more efficiently are universal in pretty much any database.

By asking for less, we improved query latency nearly 5x.
By organizing our data so the database could work more efficiently, we improved query latency nearly 10x.

Even if there’s no meaningful real-world difference to an application between latency of 17 ms vs. 4 ms, or 56 ms vs. 3 ms, by using fewer resources you’ll require a less powerful database cluster, so these latency reductions can become cost reductions.

Hopefully you’ve taken away something useful from this article. To be one of the first to read others like this, or for questions/comments, please follow me on X. Thanks!

Reveal Hidden Appendix

Test Methodology

I loaded the data into a SingleStore S-2 cluster in the AWS US-EAST-1 region (low latency to SingleStore’s default Notebooks location) in a SingleStore Trial Account. I didn’t use the SingleStore Shared Tier because it limits the amount of data and is more noisy for benchmarking. The data load script and procedure was based on loadit customized by o3-mini.

Here’s my generation cell content:

import singlestoredb as s2
import io
import random
import os
import json
import base64
from faker import Faker

NUM_ROWS = 20_000_000
TARGET_TABLE_NAME = "readers"

def setup(cur):
    cur.execute(f"DROP TABLE IF EXISTS {TARGET_TABLE_NAME}")
    cur.execute(
        f"""
        CREATE TABLE {TARGET_TABLE_NAME} (
          id INT PRIMARY KEY,
          name LONGTEXT NOT NULL,
          created DATETIME(6) NOT NULL,
          likes BIGINT NOT NULL,
          properties JSON,
          long_string LONGTEXT NOT NULL
        )
        """
    )

with s2.connect(connection_url) as conn:
    with conn.cursor() as cur:
        setup(cur)

faker = Faker(use_weighting=False)

def generate(i):
    # Generate a base64-encoded string of 1536 random bytes (~2048 characters)
    long_str = base64.b64encode(os.urandom(1536)).decode('utf-8')

    animals = ["Cat", "Dog", "Rabbit", "Elephant", "Lion", "Tiger"]
    properties = {
        "favorite": {
            "color": faker.color_name(),
            "animal": random.choice(animals)
        },
        "age": random.randint(18, 90),
        "city": faker.city(),
        "job": faker.job(),
        "company": faker.company(),
        "hobby": faker.word(),
        "ip": faker.ipv4(),
        "status": random.choice(["active", "inactive"]),
        "score": round(random.uniform(0, 100), 2),
        "uuid": faker.uuid4(),
        "boolean": random.choice([True, False]),
        "random_string": faker.text(max_nb_chars=20),
        "balance": round(random.uniform(1000, 5000), 2),
        "long_string": long_str
    }
    return [
        i,
        faker.name(),
        faker
            .date_time_between(start_date='-3y', end_date='now')
            .strftime('%Y-%m-%d %H:%M:%S.%f'),
        random.randint(1, 100000),
        json.dumps(properties),
        long_str
    ]

Running the above gave me:

Total rows: 20000000
Elapsed time: 1622s
Rows added: 20000000
Rows/sec: 12328
Estimated time remaining: 0.0s
done

I then made sure everything was fully optimized:

OPTIMIZE TABLE readers FULL

And created the alternative tables:

CREATE TABLE readers_sort (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    SORT KEY(created)
) AS SELECT id, name, created, likes, properties FROM readers;

CREATE TABLE readers_sort_keys (
    id INT PRIMARY KEY,
    name LONGTEXT NOT NULL,
    created DATETIME(6) NOT NULL,
    likes BIGINT NOT NULL,
    properties JSON,
    $fav_color AS properties::favorite::$color PERSISTED LONGTEXT,
    $fav_animal AS properties::favorite::$animal PERSISTED LONGTEXT,
    KEY($fav_color, $fav_animal),
    SORT KEY(created)
) AS SELECT id, name, created, likes, properties FROM readers;

I ran queries with this function (twice for each, taking the second):

import random
import time
import statistics

ITERATIONS = 1000
execution_times = []

with s2.connect(connection_url) as conn:
    with conn.cursor() as cur:
            for _ in range(ITERATIONS):
                random_id = random.randint(0, 20000000)
                query = f"""
                SELECT *
                FROM readers
                WHERE id = {random_id};
                """
                start = time.perf_counter()
                cur.execute(query)
                execution_times.append(time.perf_counter() - start)

median_time = statistics.median(execution_times)
print(f"Median execution time over {ITERATIONS} runs: {median_time:.6f} seconds")

That’s it! Create your own SingleStore Trial Account and give it a try yourself, or maybe see how this applies to some other database.