Jason Thorsness

github
github icon
linkedin
linkedin icon
twitter
twitter icon
hi
14Aug 28 24

JSON Schema with Speed

JSON Schema is a popular technology used OpenAI and many other companies. It’s often applied for input validation at API boundaries — but what if you could use it directly within your database? Other scenarios open up, such as:

  • Retrieving or counting documents that match or don’t match a given schema
  • Rejecting inserts and updates of JSON if they don’t match a schema
  • Moving documents that don’t match a schema to a different table

Well now you can, using SingleStore’s CREATE EXTENSION command with the s2valijson extension I wrote wrapping the Valijson library!

It’s now incredibly easy to write JSON Schemas using LLMs, so I think we’ll see an uptick in popularity of this technology. In this article, I’ll cover how you can use a SingleStore Extension to add JSON Schema capabilities to your database with a simple SQL statement then apply them to some practical scenarios.

Read on or 🍿 watch a video for a short walkthrough of this article.

Installing the JSON Schema Extension

I created the s2valijson extension by compiling Valijson for WebAssembly with a simple wrapper, using the same process I described in my article on s2re2.

SingleStore Extensions are easy to install straight from GitHub. To add it in a version-safe way, you can copy/paste the following SQL:

-- versioned s2valijson extension
CREATE EXTENSION s2valijson_2024_08_27
FROM HTTP
'https://github.com/jasonthorsness/s2valijson/raw/main/dist/2024_08_27.tar';

-- s2valijson_validate_json
DELIMITER //
CREATE OR REPLACE FUNCTION s2valijson_validate_json(
  input LONGTEXT,
  sch LONGTEXT)
  RETURNS TINYINT(1) AS
BEGIN
  RETURN s2valijson_2024_08_27_validate_json(input, sch);
END //
DELIMITER ;

-- s2valijson_validate_json_raise
DELIMITER //
CREATE OR REPLACE FUNCTION s2valijson_validate_json_raise(
  input LONGTEXT,
  sch LONGTEXT)
  RETURNS LONGTEXT AS
BEGIN
  RETURN s2valijson_2024_08_27_validate_json_raise(input, sch);
END //
DELIMITER ;

-- s2valijson_validate_json_errors
DELIMITER //
CREATE OR REPLACE FUNCTION s2valijson_validate_json_errors(
  input LONGTEXT,
  sch LONGTEXT)
  RETURNS LONGTEXT AS
BEGIN
  RETURN s2valijson_2024_08_27_validate_json_errors(input, sch);
END //
DELIMITER ;

Once the extension is installed, you can use it in all your queries. If you want to follow along, try it out on SingleStore’s Free Shared Tier.

Using the JSON Schema Extension

Let’s apply this JSON Schema extension to a few common scenarios. First, we need a schema. Gone are the days of writing these by hand — we can just ask an LLM in English and it will generate one for us. For example, here I’ve asked ChatGPT to create a schema for a blog post:

Write me a JSON schema for a blog post with a title, author, tags, and content.
All properties are required and must be non-zero length, and no extra
properties are allowed, and tags must be lowercase ascii strings.

Here’s the unedited schema ChatGPT generated for us. Pretty easy!

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "minLength": 1
    },
    "author": {
      "type": "string",
      "minLength": 1
    },
    "tags": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "string",
        "minLength": 1,
        "pattern": "^[a-z0-9]+$"
      }
    },
    "content": {
      "type": "string",
      "minLength": 1
    }
  },
  "required": ["title", "author", "tags", "content"],
  "additionalProperties": false
}

Let’s create a UDF to return the schema. This leads to more readable queries.

DELIMITER //
CREATE OR REPLACE FUNCTION json_schema_blog_post_v1()
RETURNS JSON COLLATE utf8mb4_bin NOT NULL AS
BEGIN
    RETURN '
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "minLength": 1
    },
    "author": {
      "type": "string",
      "minLength": 1
    },
    "tags": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "string",
        "minLength": 1,
        "pattern": "^[a-z0-9]+$"
      }
    },
    "content": {
      "type": "string",
      "minLength": 1
    }
  },
  "required": ["title", "author", "tags", "content"],
  "additionalProperties": false
}';
END //
DELIMITER ;

Now we can apply this schema to a number of interesting scenarios. Let’s create a posts table for the demo:

CREATE TABLE posts (
  id INT NOT NULL PRIMARY KEY,
  data JSON COLLATE utf8mb4_bin NOT NULL
);

Validating at Insert Time

If you want to validate that inserted data conforms to a schema, and your client application doesn’t support JSON Schema or you are ingesting straight from a raw source like S3 or Kafka via a SingleStore Pipeline, you can validate right in your SQL. For example:

INSERT INTO posts (id, data) VALUES (0, s2valijson_validate_json_raise(
    '{
        "title": "Hello, World",
        "author": "Jason",
        "tags": ["tech", "database"],
        "content": "This is a blog post."
    }',
    json_schema_blog_post_v1()
    ));

That passed! But what if our tags are not lowercase and we’re missing the author?

INSERT INTO posts (id, data) VALUES (1, s2valijson_validate_json_raise(
    '{
        "title": "Hello, World",
        "tags": ["Tech", "database"],
        "content": "This is a blog post."
    }',
    json_schema_blog_post_v1()
    )
);

This returns the following errors, and no document is inserted.

 <root>.[tags].[0]: Failed to match regex specified by 'pattern' constraint.
 <root>.[tags]: Failed to validate item #0 in array.
 <root>: Failed to validate against schema associated with property name 'tags'.
 <root>: Missing required property 'author'.

Validation as a Query

Let’s say we have an existing collection of blog posts we want to validate. As I aspire to be the world’s most prolific blogger, let’s fill our collection with ten thousand valid random posts.

DELIMITER //
DO DECLARE
    arr ARRAY(RECORD(id INT NOT NULL, data JSON NOT NULL))
        = CREATE_ARRAY(9999);
    x INT;
BEGIN
    FOR i in 1 .. 9999 LOOP
    arr[i-1] = ROW(i,
    s2valijson_validate_json_raise(
        CONCAT(
        '{
            "title": "',RAND(),'",
            "author": "Jason",
            "tags": ["tech", "database", "',
                CHAR(FLOOR(RAND() * 26) + 97),'"],
            "content": "',RAND(),'"
        }'),json_schema_blog_post_v1()
        ));
    END LOOP;
    x = INSERT_ALL("posts", arr);
END //
DELIMITER ;

We can easily reconfirm that all the records in the table are valid:

SELECT
    SUM(s2valijson_validate_json(data, json_schema_blog_post_v1())),
    COUNT(*)
FROM posts;

What if we add 10000 more, but some of those have invalid tags with unacceptable characters such as curly brace and pipe?

DELIMITER //
DO DECLARE
    arr ARRAY(RECORD(id INT NOT NULL, data LONGTEXT NOT NULL))
        = CREATE_ARRAY(10000);
    x INT;
BEGIN
    FOR i in 10000 .. 19999 LOOP
        arr[i-10000] = ROW(i,
            CONCAT(
            '{
                "title": "',RAND(),'",
                "author": "Jason",
                "tags": ["tech", "database", "',
                    CHAR(FLOOR(RAND() * 28) + 97),'"],
                "content": "',RAND(),'"
            }'));
    END LOOP;
    x = INSERT_ALL("posts", arr);
END //
DELIMITER ;

Now the count above shows we have 20000 posts, but not all are valid. Let’s print the errors from all the invalid ones:

SELECT
    SUM(s2valijson_validate_json(data, json_schema_blog_post_v1())),
    COUNT(*)
FROM posts;

SELECT
    s2valijson_validate_json_errors(data, json_schema_blog_post_v1())
FROM posts
WHERE s2valijson_validate_json(data, json_schema_blog_post_v1()) = 0
LIMIT 10;

So what to do with these bad posts? How about we transactionally move them all to a bad_posts table and delete them from the main posts table:

CREATE TABLE bad_posts(
  id INT NOT NULL PRIMARY KEY,
  data JSON COLLATE utf8mb4_bin NOT NULL
);

START TRANSACTION;
INSERT INTO bad_posts
    SELECT id, data
    FROM posts
    WHERE s2valijson_validate_json(data, json_schema_blog_post_v1()) = 0;
DELETE posts FROM posts INNER JOIN bad_posts ON posts.id = bad_posts.id;
COMMIT;

Now my bad posts are nicely quarantined:

SELECT COUNT(*) FROM bad_posts;

SELECT
    SUM(s2valijson_validate_json(data, json_schema_blog_post_v1())),
    COUNT(*)
FROM posts;

These are just a few of the interesting applications for having JSON Schema validation directly within your database. Go and use it today!

Closing Thoughts

JSON Schema is a powerful tool for validating JSON data, and it’s now easy to use within SingleStore. Having the ability to use JSON Schema directly within the database enables many scenarios beyond validation at insert time. While JSON is often thought of as schema-less, over time it can actually be a huge pain to manage all the different shapes of JSON documents that might accumulate in a long-lived database. This structure evolution is almost inevitable due to changing business requirements, updates from third-party APIs, and many other factors, and JSON Schema can be a way to understand and manage this complexity.

Beyond this specific use case, the JSON Schema extension represents another useful application of SingleStore’s CREATE EXTENSION command and WebAssembly UDFs. I discussed these more in a previous article. This safe approach to extensibility will lead to a lot more interesting functionality being added to SingleStore, and maybe other databases should follow suit! If you have any ideas for more extensions, let me know on X @jasonthorsness.

 Top