WebAssembly SIMD

Why SIMD?

SIMD can accelerate many operations, especially in graphics and physics, roughly proportionate to the size ratio between a non-vector element to the vector size — in the case of WebAssembly, an operation over int32 (4 bytes) can be 4x faster when using a 128-bit vector. An operation over individual bytes can be 32x faster with a 128-bit vector — which we will see below.

Easy Mode

The DotNet System.Numerics.Vector and related classes expose simple functions for common operations that internally use SIMD instructions when available. In addition, many of the basic operations on Span<T> are also accelerated using SIMD instructions.

Hard Mode

Unfortunately many of these operations are not yet accelerated with NativeAOT-LLVM (Issue). To be guaranteed a SIMD speedup, you have to write the code in C using SIMD intrinsics. But it’s not actually that hard.

Try It Now

When you drag this slider, the bitwise complement of a buffer is taken 999 times, first without SIMD, then with SIMD. The total time taken for the 999 iterations is displayed.

Input

100 KiB

Without SIMD

??? ms

With SIMD

??? ms

Speed Up

???x Faster with SIMD

Here are some example results from my machine:

Size	Without SIMD	With SIMD	Speedup
100	46 ms	2 ms	22x
1000	461 ms	19 ms	24x

Enabling SIMD

To enable SIMD compilation there are a couple of steps. First, you need to use the -msimd128 flag when compiling the C file. This is the flag that tells Emscripten to use SIMD.

<Target Name="CompileNativeLibrary" BeforeTargets="BeforeBuild">
  <Exec Command="emcc -msimd128 -c lib.c -O2 -o lib.o" />
</Target>

The same flag should be passed in to dotnet during the publish step.

 /p:EmccExtraArgs="-msimd128 -s EXPORTED_FUNCTIONS=...

Now Emscripten will generate SIMD instructions.

Writing SIMD

The SIMD instructions can be used in C code per the Emscripten guidance.

Here is a function to perform a bitwise complement:

#include <stddef.h>
#include <wasm_simd128.h>

void bitwise_complement(uint8_t* ptr, int length) {
    v128_t* simd_ptr = (v128_t*)ptr;
    size_t num_vectors = length / sizeof(v128_t);
    for (size_t i = 0; i < num_vectors; ++i) {
        wasm_v128_store(
            simd_ptr + i,
            wasm_v128_not(wasm_v128_load(simd_ptr + i)));
    }
}

For comparison, here is the “without SIMD” function used for comparison above:

[UnmanagedCallersOnly(EntryPoint = "BitwiseComplementSlow")]
public static unsafe void BitwiseComplementSlow(byte* f, int l)
{
    Span<byte> span = new(f, l);
    for (int i = 0; i < span.Length; i++)
    {
        span[i] = (byte)~span[i];
    }
}

Supporting Functions

One thing you might have noticed in the above is the function takes a byte*. WebAssembly can only operate on its own memory space. I’ve added functions to allocate and free memory (you can export malloc and free, but I wanted more control). My functions look like this:

[UnmanagedCallersOnly(EntryPoint = "Alloc")]
public static unsafe byte* Alloc(int length)
{
    return (byte*)NativeMemory.AlignedAlloc(
        (nuint)length,
        (nuint)Vector<byte>.Count);
}

[UnmanagedCallersOnly(EntryPoint = "Free")]
public static unsafe void Free(byte* ptr)
{
    NativeMemory.AlignedFree(ptr);
}

Calling the Functions

Here’s the web worker code for the example above. It does the following:

Import a loader, which loads the wasm.js file generated by Emscripten and sets up the types on self.
Wait for the Emscripten onRuntimeInitialized event.
Add a listener that for each event received:
1. Allocates a buffer of length value * 1024.
2. Calls the BitwiseComplement function 999 times, timing it.
3. Frees the buffer.

For the “slow” version it’s the same, except the call is to BitwiseComplementSlow.

import "components/wasm/loader";
Module.onRuntimeInitialized = () => {
  addEventListener("message", (event: MessageEvent<number>) => {
    self._Init();
    const size = event.data * 1024;
    const buffer = self._Alloc(size);
    const start = performance.now();
    for (let i = 0; i < 999; ++i) {
      self._BitwiseComplement(buffer, size);
    }
    const end = performance.now();
    self._Free(buffer);
    postMessage({ value: event.data, time: end - start });
  });
  postMessage({ value: 0, time: 0 });
};

What’s Next?

Really this sort of thing might be temporary as I am sure the DotNet team will get around to implementing the System.Numerics.Vector type with SIMD instructions. But for now, the acceleration can make the investment in manually doing it worth it.