Randomized Demo Challenge

Fake Data and Lots Of It

SingleStore shines when it has a lot of data to work with. This comes up while scale-testing new features and also when demonstrating the value of SingleStore to potential customers. In those cases, a generic bulk data set is often not good enough — a highly-tailored data set is ideal. I’ve spent days writing test programs and scripts to generate custom data. But for me, the days of “days” are over now that I have started using a workflow where I delegate all the work to an LLM and to SingleStore.

Who should care about this? Maybe if you work in sales at SingleStore, if you’re one of the SingleStore partners, or, perhaps, YOU, potential SingleStore customer, who can now use this workflow to effortlessly sell yourself the SingleStore database. Once again, in the spirit of delegation 😉

The entire workflow is super-simple and available on GitHub for you to try out now. But if you came for the challenge, read on or 🍿 watch a video for a short walkthrough of the process.

Note On Prompt

The video uses a different sample prompt than the content below, but the prompt and process are the same.

Step 1: Create A SingleStore Database

Go to https://www.singlestore.com/try-free/ and make an account, then log in to the SingleStore Portal. Click on “New Deployment” in the upper left.

“Starter” is limited to 1 GiB of data, so if you want to test with more, create an S-00 instead. Pick “AWS US East 1 (N. Virginia)” since this is close to the notebooks environment where we will generate the data.

Click “Create Deployment” and wait for the deployment to be ready.

Step 2: Create a Notebook

Now on the left navigation, select “Data Studio” and in the upper-right, click New Notebook. Give it a memorable name like “Chocolate Data Loader” and pick “Blank Notebook”.

Step 3: Create the Data Loader

In your notebook, you will create three cells.

Cell 1: Prerequisites

From github.com/jasonthorsness/loadit copy the first cell and run it.

%sql CREATE DATABASE IF NOT EXISTS test;
!pip install faker

⚠️ After this, from the drop-down in the upper-left of the notebook, make sure you select “test” as your target database, or the next step will fail.

Cell 2: Generate

Next, from github.com/jasonthorsness/loadit copy the setup/generate template into ChatGPT. Pick the o1-preview model. Prompt it to rewrite the setup and generate code to match your scenario. For example:

Rewrite this code to generate a test data set for the owner of a chocolate factory who wants to track production in real-time. The data should be as rich and complex as the chocolate itself. Make sure it is suitable for showing the SingleStore features of JSON, fast aggregations, and full-text search.

I am not a prompt engineer, but that should do it. Let o1-preview mull over that, and you should get something like the below back in less than a minute. Copy it as your second cell and run it. Typically it will produce bug-free code — we truly live in a new era.

# version 2024-09-25
import singlestoredb as s2
import io
import random
import os
import json
import datetime
from faker import Faker

# Globals should reflect the number of rows to load and the table name
NUM_ROWS = 1_000_000
TARGET_TABLE_NAME = 'production'

# The setup function is immediately invoked below to create the target table
def setup(cur):
    cur.execute(f'CREATE DATABASE IF NOT EXISTS {connection_default_database}')
    cur.execute(f'DROP TABLE IF EXISTS {TARGET_TABLE_NAME}')
    cur.execute(f'''
        CREATE TABLE {TARGET_TABLE_NAME} (
          `production_id` BIGINT PRIMARY KEY,
          `timestamp` DATETIME NOT NULL,
          `machine_id` VARCHAR(50) NOT NULL,
          `operator_id` VARCHAR(50) NOT NULL,
          `product_type` VARCHAR(100) NOT NULL,
          `quantity` INT NOT NULL,
          `weight` FLOAT NOT NULL,
          `quality_metrics` JSON NOT NULL,
          `description` TEXT,
          `ingredients` JSON NOT NULL,
          `production_line` VARCHAR(50) NOT NULL,
          FULLTEXT USING VERSION 2 (description)
        )
    ''')

# connection_url is already defined in global scope
with s2.connect(connection_url) as conn:
    with conn.cursor() as cur:
        setup(cur)

faker = Faker(use_weighting=False)

product_types = [
    'Milk Chocolate', 'Dark Chocolate', 'White Chocolate', 'Hazelnut Chocolate',
    'Mint Chocolate', 'Caramel Chocolate', 'Fruit & Nut Chocolate',
    'Chocolate Truffle'
]
machine_ids = [f"MACH-{n}" for n in range(1, 21)]
operator_ids = [f"OP-{n}" for n in range(1, 51)]
production_lines = [f"Line-{n}" for n in range(1, 6)]
possible_ingredients = [
    'Cocoa', 'Sugar', 'Milk', 'Hazelnut', 'Almond', 'Mint', 'Caramel', 'Vanilla',
    'Fruit'
]

# The generate function will be called once for each row
# i is a unique 0-based integer row number
# it should return fields in the order required to INSERT in the table above
def generate(i):
    # production_id
    production_id = i

    # timestamp: random datetime in the past month
    now = datetime.datetime.now()
    timestamp = now - datetime.timedelta(seconds=random.randint(0, 2592000))

    # machine_id
    machine_id = random.choice(machine_ids)

    # operator_id
    operator_id = random.choice(operator_ids)

    # product_type
    product_type = random.choice(product_types)

    # quantity: between 100 and 1000
    quantity = random.randint(100, 1000)

    # weight: quantity times random weight per unit between 0.05kg and 0.2kg
    weight_per_unit = random.uniform(0.05, 0.2)
    weight = quantity * weight_per_unit

    # quality_metrics: JSON data
    quality_metrics = {
        'temperature': round(random.uniform(20.0, 30.0), 2),  # in deg. Celsius
        'humidity': round(random.uniform(30.0, 70.0), 2),     # in percent
        'viscosity': round(random.uniform(1.0, 5.0), 2),      # arbitrary units
        'notes': faker.sentence()
    }

    # description: some text
    description = faker.paragraph()

    # ingredients: JSON data
    num_ingredients = random.randint(2, 5)
    ingredients_list = random.sample(possible_ingredients, num_ingredients)
    ingredients = {}
    for ingredient in ingredients_list:
        ingredients[ingredient] = f"{random.uniform(0.1, 5.0):.2f} kg"

    # production_line
    production_line = random.choice(production_lines)

    return [
        production_id,
        timestamp.strftime('%Y-%m-%d %H:%M:%S'),
        machine_id,
        operator_id,
        product_type,
        quantity,
        weight,
        json.dumps(quality_metrics),
        description,
        json.dumps(ingredients),
        production_line
    ]

Cell 3: Load

Finally, copy the loader script from github.com/jasonthorsness/loadit. I won’t show it here, because this cell never needs to be changed. Run the cell, and you should get 1,000,000 rows of delicious chocolate-themed data in your database. This should take less than a minute. Here’s what I got for the above:

Total rows: 1000000 Elapsed time: 31s Rows added: 1000000 Rows/sec: 31937

I’m no Python expert, but I spent a bunch of time tuning this loader. It’s gated on client CPU in the SingleStore notebook environment. If you can find a faster way, please submit an MR!

Demo Time

Now you can demo SingleStore features like JSON, full-text search, and fast aggregations. You can execute the queries from another Notebook, or from the SQL Editor tab at the top of the screen.

When you’re ready, run the loader for longer to make the demo more impressive. Here are some sample queries for our chocolate factory owner:

Instant aggregations!

SELECT SUM(quantity), product_type
FROM production
GROUP BY product_type

SUM(quantity)	product_type
69000960	Chocolate Truffle
68531622	Milk Chocolate
69087133	Caramel Chocolate
68570899	Fruit & Nut Chocolate
68968829	Dark Chocolate
68470740	Mint Chocolate
68585170	White Chocolate
68759210	Hazelnut Chocolate

Fast and flexible JSON!

SELECT AVG(quality_metrics::%humidity)
FROM production
WHERE NOT ISNULL(ingredients::$Almond)

AVG(quality_metrics::%humidity)
50.015114248278564

Advanced full-text search!

SELECT * FROM production
WHERE
QUANTITY > 500 AND
MATCH (TABLE production) AGAINST
('description:eat^4 AND description:enjoy AND description:bag');

production_id	description
79586	Pattern entire sing end. Eat stop fall maintain ball. Well audience main media enjoy fight bag.

What Next?

With the help of LLMs and notebooks you can generate large amounts of highly-tailored data in order to explore and demonstrate SingleStore features. Beyond what this article covered, the loader templates also include a version for MongoDB data if you’d like to use SingleStore Kai. Try it today!