SQL excels at expressing what data to fetch and return — there’s a reason it’s been unbeatable for 50 years. But when it comes to data transformation, SQL lags general-purpose programming languages and their ecosystems. Recognizing this gap, some database engines, including SingleStore, allow extending SQL with C, C++, Rust, Go, C#, and many more languages, through the use of WebAssembly.
A challenge with WebAssembly has been ease-of-use: getting “wasm” files into the database. I’m
delighted to report that SingleStore’s new
CREATE EXTENSION
command eliminates this problem. Behold, a single-statement deployment of the
Google RE2 regular expression library into SingleStore:
-- install from https://github.com/jasonthorsness/s2re2
CREATE EXTENSION s2re2_2024_08_18
FROM HTTP
'https://github.com/jasonthorsness/s2re2/raw/main/dist/2024_08_18.tar';
-- returns true
SELECT s2re2_2024_08_18_match('abc', 'a(.*)c');
-- returns b (the contents of the first capture group)
SELECT s2re2_2024_08_18_match_extract('abc', 'a(.*)c');
RE2 is awesome and this s2re2 extension has immediate real-world applications:
Safe matching with untrusted patterns and input with linear time complexity.
🔥 It outperforms REGEXP
on some inputs.
Copy the SQL above and try it today on SingleStore’s free shared tier.
For more documentation and examples of how to use the s2re2 extension and what scenarios are a good fit, check out the s2re2 repo.
So it’s easy to use, but was it straightforward to create?
It was! And the same approach can be applied to create many other extensions. Read on or 🍿 watch a video for a complete walkthrough of how I created the extension and how you can create your own to use and share.
Let’s walk through the steps to create this extension from scratch.
The code below is all copied from the s2re2 repo. Use landscape mode to best view source blocks on mobile.
A Docker-based build provides a common approach for any extension, so make sure Docker is installed. I’ll also be using Bash-style shell commands, so if you’re on Windows, you should be using WSL.
What exactly are we building? SingleStore WASM UDFs are based on an early version of the WASI 0.1 Canonical ABI, an effort to standardize interfaces and calling conventions between WebAssembly modules and their host environments. There’s not a great reference for the version used by SingleStore; the most authoritative is the source code of a specific version of the WASI binding generator.
Fortunately, the contract is mostly simple, so rather than use a binding generator I prefer to just write the bindings by hand. To work as a UDF in SingleStore a WebAssembly module needs to expose the following:
Export | Purpose |
---|---|
_initialize | Optional function to initialize the module. |
canonical_abi_realloc | Function to allocate memory. |
canonical_abi_free | Function to free memory. |
my_function | Function implementing the UDF (any name is fine). |
During a query that involves a WASM UDF, SingleStore will create one or more instances of the
module. For each instance it will call _initialize
once, then my_function
as many times as it
needs. When calling my_function
:
Numbers are passed as 32-or-64-bit integers or floats.
Strings and arrays are passed as a 32-bit pointer followed by a 32-bit length.
Structures (RECORD
type in SingleStore) are passed as packed values.
If there is a single return value, it is returned directly. For multiple return values, including strings and arrays, the return value is a pointer to a memory region containing the actual return values.
To pass strings and arrays, SingleStore calls canonical_abi_realloc
to get a pointer, copies the
value to that location, and calls my_function
with the pointer and length. It’s the responsibility
of my_function
to free all string and array pointer parameters before returning.
For return values, my_function
should allocate memory for strings and arrays with the
understanding that SingleStore will call canonical_abi_free
on any returned string and array
pointers.
So for example, to expose a function in SQL like this:
CREATE FUNCTION foo(a INT NOT NULL, b BIGINT NOT NULL) RETURNS INT NOT NULL ...
All the types are primitives, so there’s no need to use the memory allocation functions. The function signature could look like this:
int32_t foo(int32_t a, int64_t b);
For a more complex example, consider a function that takes a LONGTEXT and returns a BOOL and LONGBLOB:
CREATE FUNCTION bar(a LONGTEXT NOT NULL)
RETURNS RECORD(m BOOL NOT NULL, s LONGBLOB NOT NULL) NOT NULL ...
Here the LONGTEXT parameter is passed as a pointer and length, and since the return has multiple values, it’s passed as the address of a structure containing those values. The LONGBLOB in the result is also a pointer and length.
struct bar_result_type
{
bool m;
char* s_ptr;
size_t s_len;
};
bar_result_type bar_result;
void *bar(char *a_ptr, size_t a_len)
{
// the caller will use canonical_abi_realloc to allocate memory for a_ptr
// remember to free a_ptr in this function
// ...
// on return, allocate memory for s_ptr and fill in bar_result
// the caller will free s_ptr with canonical_abi_free
//
return &bar_result;
}
These examples cover most of what you’d need for WASM UDFs in SingleStore. For more details on parameter passing, even though it’s for a newer version of the overall WASI contract, the Canonical ABI Explainer still has good information on parameter passing and returns (which it calls “lowering” and “lifting”).
Any WebAssembly module implementing these conventions can be used as a UDF in SingleStore, regardless of which language it was written in and what tool chain was used to create it. That’s one of the cool features of WebAssembly. For the s2re2 extension, we’ll use C++ since that’s the language of RE2.
We’ll write three short C++ files to wrap the RE2 library and expose it as a UDF:
File | Purpose |
---|---|
mem.cpp | canonical_abi_realloc and canonical_abi_free . |
main.cpp | Match and MatchExtract |
abs.cpp | A dummy Abseil Mutex, to help reduce the binary size. |
Our memory-management functions simply wrap realloc and free. Note the EMSCRIPTEN_KEEPALIVE tag; we’ll be compiling using Emscripten later so this is necessary to ensure the functions are not trimmed out of the generated module.
#include <emscripten.h>
extern "C" EMSCRIPTEN_KEEPALIVE void* canonical_abi_realloc(
void* ptr, size_t /* currentSize */, size_t /* align */, size_t newSize)
{
void* result = realloc(ptr, newSize);
if (result == 0)
{
abort();
}
return result;
}
extern "C" EMSCRIPTEN_KEEPALIVE void canonical_abi_free(
void* ptr, size_t /* size */, size_t /* align*/)
{
free(ptr);
}
The actual UDF implementations are Match
and MatchExtract
. These wrap the corresponding methods
of the RE2 API. We’ll need a few headers.
// We are using absl::string_view since RE2 already uses that; no need to bring
// in std::string_view and inflate the binary size.
//
#include "absl/strings/string_view.h"
#include <emscripten.h>
#include <re2/re2.h>
Since the same module instance can be used multiple times within a query, and typically the same pattern will be used, we can improve performance by caching a compiled RE2 instance.
// Globals to save a compiled pattern across invocations.
//
absl::string_view latchedPattern;
RE2* latchedInstance;
Now we’ll introduce the Match function, which will take the test and pattern strings, check whether
we can use the cached already-compiled instance, call RE2::PartialMatch
, and return the result.
Why the #ifdef COMPILE_MATCH
? There is a small performance advantage to keeping WASM files backing
UDFs as small as possible, especially when they are used in INSERT ... VALUES
statements.
Compiling a separate WASM module for each UDF gives the compiler toolchain more opportunities to
discard unused code. Each generated WASM file will end up smaller than a common one that has all the
functions together — so even though in this case I don’t think it makes a huge difference, I’ll
follow that best practice.
#ifdef COMPILE_MATCH
extern "C" EMSCRIPTEN_KEEPALIVE bool Match(
char* testPtr, size_t testLen, char* patternPtr, size_t patternLen)
{
absl::string_view test(testPtr, testLen);
absl::string_view pattern(patternPtr, patternLen);
if (latchedInstance != nullptr && pattern != latchedPattern)
{
bool result = RE2::PartialMatch(test, pattern);
free(patternPtr);
free(testPtr);
return result;
}
if (latchedInstance == nullptr)
{
latchedPattern = pattern;
latchedInstance = new RE2(pattern, RE2::Quiet);
}
else
{
free(patternPtr);
}
bool result = RE2::PartialMatch(test, *latchedInstance);
free(testPtr);
return result;
}
#endif
MatchResult is slightly more complicated, because it will return both whether or not the pattern matched the test string, but also the first submatch. We’ll define a structure for our result, declare a global variable of that structure, and return the address of that global variable.
#ifdef COMPILE_MATCH_EXTRACT
struct MatchExtractResult
{
bool matched;
char* submatchPtr;
size_t submatchLen;
};
// Since we have a multi-value return, we'll return the address of this global
// struct containing the multiple return values.
//
MatchExtractResult matchExtractResult;
// MatchExtract returns the result along with a single submatch
// To return the entire pattern, wrap the whole thing in parentheses
//
extern "C" EMSCRIPTEN_KEEPALIVE void* MatchExtract(
char* testPtr, size_t testLen, char* patternPtr, size_t patternLen)
{
absl::string_view test(testPtr, testLen);
absl::string_view pattern(patternPtr, patternLen);
if (latchedInstance != nullptr && pattern != latchedPattern)
{
absl::string_view extracted;
if (!RE2::PartialMatch(test, pattern, &extracted))
{
free(patternPtr);
free(testPtr);
matchExtractResult = {false, nullptr, 0};
return &matchExtractResult;
}
free(patternPtr);
// Rather than free testPtr, since we own it reuse it for the result
//
memcpy(testPtr, extracted.data(), extracted.size());
matchExtractResult = {true, testPtr, extracted.size()};
return &matchExtractResult;
}
if (latchedInstance == nullptr)
{
latchedPattern = pattern;
latchedInstance = new RE2(pattern, RE2::Quiet);
}
else
{
free(patternPtr);
}
absl::string_view extracted;
if (!RE2::PartialMatch(test, *latchedInstance, &extracted))
{
free(testPtr);
matchExtractResult = {false, nullptr, 0};
return &matchExtractResult;
}
// Rather than free testPtr, since we own it reuse it for the result
//
memcpy(testPtr, extracted.data(), extracted.size());
matchExtractResult = {true, testPtr, extracted.size()};
return &matchExtractResult;
}
#endif
Finally, there’s one more piece of C++. RE2 is thread-safe and uses an Abseil mutex, which increases the generated WASM size by ~30 KiB. Since WebAssembly is single-threaded anyway, we can just link in a stub version of the mutex instead.
#include "absl/synchronization/mutex.h"
// Since WebAssembly is single-threaded we can have a dummy Mutex
//
namespace absl
{
inline namespace lts_20240722
{
void Mutex::Lock()
{
}
void Mutex::Unlock()
{
}
void Mutex::ReaderLock()
{
}
void Mutex::ReaderUnlock()
{
}
} // namespace lts_20240722
} // namespace absl
We’ll use Emscripten to compile the C++ code into a WASM module. It’s also possible to use the WASI SDK, but I’ve found Emscripten to be generally easier to use.
ARG RE2_TAG=2024-07-02
ARG ABSEIL_TAG=20240722.0
# start with the emscripten builder image
FROM emscripten/emsdk:3.1.64
# get the build tools
RUN apt-get update && \
apt-get install -y git cmake build-essential
First we’ll build the Abseil library which RE2 depends on. We don’t need logging; setting
ABSL_MIN_LOG_LEVEL=99
and ABSL_MAX_VLOG_VERBOSITY=-99
will strip all logging support code from
the binary and significantly reduce its size.
# build abseil
ARG ABSEIL_TAG
RUN git clone --branch ${ABSEIL_TAG} \
--depth 1 https://github.com/abseil/abseil-cpp.git /abs
WORKDIR /abs
RUN mkdir build && \
cd build && \
emcmake cmake .. \
-DCMAKE_POSITION_INDEPENDENT_CODE=TRUE \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=20 \
-DCMAKE_CXX_FLAGS="-DSTANDALONE_WASM=1 -fno-rtti -fno-exceptions -flto \
-DABSL_MIN_LOG_LEVEL=99 -DABSL_MAX_VLOG_VERBOSITY=-99"
RUN cd build && \
emmake make -j$(nproc) && \
emmake make install
Now we’ll build RE2.
# build re2
ARG RE2_TAG
RUN git clone --branch ${RE2_TAG} \
--depth 1 https://github.com/google/re2.git /re2
WORKDIR /re2
RUN mkdir -p build && \
cd build && \
emcmake cmake .. \
-DRE2_BUILD_TESTING=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=20 \
-DCMAKE_CXX_FLAGS="-DSTANDALONE_WASM=1 -fno-rtti -fno-exceptions -flto \
-DABSL_MIN_LOG_LEVEL=99 -DABSL_MAX_VLOG_VERBOSITY=-99"
RUN cd build && \
emmake make -j$(nproc)
With the libraries built, we can compile the main modules. We’ll compile a separate one for each UDF; as mentioned above keeping the per-UDF WASM size as small as possible has a performance advantage over exporting multiple functions from the same UDF.
One interesting thing to note here is EVAL_CTORS. Emscripten has this awesome feature where it will run constructors for global variables at compile time and snapshot the results, eliminating the need to call them in subsequence uses of the module. In some cases this can have a significant performance benefit.
# build Match and MatchExtract
COPY main.cpp /main.cpp
COPY mem.cpp /mem.cpp
COPY abs.cpp /abs.cpp
# Match
RUN emcc -DCOMPILE_MATCH -o /match.js \
/main.cpp /mem.cpp /abs.cpp \
-O3 -s STANDALONE_WASM -s MALLOC=emmalloc -s EVAL_CTORS=2 \
-fno-rtti -fno-exceptions -flto --no-entry \
-s EXPORTED_FUNCTIONS='[ \
"_canonical_abi_realloc", \
"_canonical_abi_free", \
"_Match"]' \
-I/re2 \
/re2/build/libre2.a \
/abs/build/absl/strings/*.a \
/abs/build/absl/hash/*.a \
/abs/build/absl/container/*.a \
/abs/build/absl/base/*.a
# MatchExtract
RUN emcc -DCOMPILE_MATCH_EXTRACT -o /match_extract.js \
/main.cpp /mem.cpp /abs.cpp \
-O3 -s STANDALONE_WASM -s MALLOC=emmalloc -s EVAL_CTORS=2 \
-fno-rtti -fno-exceptions -flto --no-entry \
-s EXPORTED_FUNCTIONS='[ \
"_canonical_abi_realloc", \
"_canonical_abi_free", \
"_MatchExtract"]' \
-I/re2 \
/re2/build/libre2.a \
/abs/build/absl/strings/*.a \
/abs/build/absl/hash/*.a \
/abs/build/absl/container/*.a \
/abs/build/absl/base/*.a
Finally, it’s important to know whether the imports and exports are correct, so the build script extracts them for debugging purposes using the WebAssembly Binary Toolkit.
# extract imports/exports for debugging
RUN apt-get install -y wabt
RUN wasm2wat /match.wasm | \
grep -E "(import|export)" \
> /match.txt
RUN wasm2wat /match_extract.wasm | \
grep -E "(import|export)" \
> /match_extract.txt
After that, the resulting files need to be copied out of the Docker container. This is orchestrated with a script as follows:
# Update to create a new version
VERSION=2024_08_18
rm -rf ./build
mkdir -p ./build
docker build -t s2re2_$VERSION ./docker/
docker create --name s2re2_$VERSION s2re2_$VERSION
docker cp s2re2_$VERSION:/match.wasm ./build/match.wasm
docker cp s2re2_$VERSION:/match.txt ./build/match.txt
docker cp s2re2_$VERSION:/match_extract.wasm ./build/match_extract.wasm
docker cp s2re2_$VERSION:/match_extract.txt ./build/match_extract.txt
docker rm s2re2_$VERSION
The SingleStore CREATE EXTENSION
command takes a URL to a tar
file as input. This file must
contain an install SQL script along with supporting files such as the WASM files we build. Typically
there are two UDFs required for each function in the SQL script - the WASM function, and a SQL-based
function that handles NULL values (since NULL is not a valid parameter or return value for a WASM
function).
Match is straightforward:
CREATE FUNCTION s2re2_${VERSION}_match_inner(
input LONGTEXT NOT NULL,
pattern LONGTEXT NOT NULL)
RETURNS TINYINT(1) NOT NULL
AS WASM
FROM LOCAL INFILE 'match.wasm'
USING EXPORT 'Match';
DELIMITER //
CREATE FUNCTION s2re2_${VERSION}_match(
input LONGTEXT,
pattern LONGTEXT)
RETURNS TINYINT(1) AS
DECLARE
result RECORD(m TINYINT(1) NOT NULL, s LONGTEXT NOT NULL);
BEGIN
IF ISNULL(input) OR ISNULL(pattern) THEN
RETURN NULL;
END IF;
return s2re2_${VERSION}_match_inner(input, pattern);
END //
For MatchExtract, we also need to take the structure returned and return either the submatch (if matched) or NULL (if not matched).
CREATE FUNCTION s2re2_${VERSION}_match_extract_inner(
input LONGTEXT NOT NULL,
pattern LONGTEXT NOT NULL)
RETURNS RECORD(m TINYINT(1) NOT NULL, s LONGTEXT NOT NULL)
AS WASM
FROM LOCAL INFILE 'match_extract.wasm'
USING EXPORT 'MatchExtract';
DELIMITER //
CREATE FUNCTION s2re2_${VERSION}_match_extract(
input LONGTEXT NULL,
pattern LONGTEXT NULL)
RETURNS LONGTEXT AS
DECLARE
result RECORD(m TINYINT(1) NOT NULL, s LONGTEXT NOT NULL);
BEGIN
IF ISNULL(input) OR ISNULL(pattern) THEN
RETURN NULL;
END IF;
result = s2re2_${VERSION}_match_extract_inner(input, pattern);
IF result.m = 0 THEN
RETURN NULL;
ELSE
RETURN result.s;
END IF;
END //
This script, match.wasm, and match_extract.wasm are packaged into a TAR file, and our extension is complete!
The UDFs installed by the s2re2 are versioned. To install
a new version of the above, you can use DROP EXTENSION and CREATE EXTENSION. This is simple, but it
has a downside, which is that queries that depend on the s2re2_match
and s2re2_match_extract
functions will break between the DROP EXTENSION and CREATE EXTENSION. There is not yet a CREATE OR
REPLACE EXTENSION command which would make the swap atomic.
To enable non-breaking upgrades, I’ll also publish a versioned extension. The versioned extension
will expose versioned functions, like s2re2_2024_08_12_match
and s2re2_2024_08_12_match_extract
.
After a version of the extension is added, wrappers can be defined or updated in an atomic way like:
-- s2re2_match
DELIMITER //
CREATE OR REPLACE FUNCTION s2re2_match(
input LONGTEXT NULL,
pattern LONGTEXT NULL)
RETURNS TINYINT(1) NULL AS
BEGIN
RETURN s2re2_2024_08_18_match(input, pattern);
END //
DELIMITER ;
-- s2re2_match_extract
DELIMITER //
CREATE OR REPLACE FUNCTION s2re2_match_extract(
input LONGTEXT NOT NULL,
pattern LONGTEXT NOT NULL)
RETURNS LONGTEXT AS
BEGIN
RETURN s2re2_2024_08_18_match_extract(input, pattern);
END //
DELIMITER ;
Queries that depend on the wrappers will then continue to work across an upgrade without any error, solving the problem.
If you’ve stayed with me through the entire article, you might be just as excited about these
WebAssembly-based extensions as I am. I envision a world where a ton of amazing functionality is
added on GitHub and used by any SingleStore user with a simple CREATE EXTENSION
command.
In addition to public extensions, the build and deployment process is simple enough for individual
users to create high-performing specializations for their use case. In the case of RE2, a function
with a global RE2 object with a fixed pattern combined with the use of EVAL_CTORs will be even
faster than the general match
and match_extract
shown here. Other cases could be even more
interesting — I think I’ll look at JSON Schema next.
Finally, what if extensions could be so easy to write such that a language like Go could be the first thought for string or blob manipulation rather than fighting through using SQL? As it turns out, that might be a possibility, as TinyGo works quite well. I’ll save that too for another article.
Thanks for reading!