SIMD can accelerate many operations, especially in graphics and physics, roughly proportionate to the size ratio between a non-vector element to the vector size — in the case of WebAssembly, an operation over int32 (4 bytes) can be 4x faster when using a 128-bit vector. An operation over individual bytes can be 32x faster with a 128-bit vector — which we will see below.
The DotNet System.Numerics.Vector and related classes expose simple functions for common operations that internally use SIMD instructions when available. In addition, many of the basic operations on Span<T> are also accelerated using SIMD instructions.
Unfortunately many of these operations are not yet accelerated with NativeAOT-LLVM (Issue). To be guaranteed a SIMD speedup, you have to write the code in C using SIMD intrinsics. But it’s not actually that hard.
When you drag this slider, the bitwise complement of a buffer is taken 999 times, first without SIMD, then with SIMD. The total time taken for the 999 iterations is displayed.
Here are some example results from my machine:
Size | Without SIMD | With SIMD | Speedup |
---|---|---|---|
100 | 46 ms | 2 ms | 22x |
1000 | 461 ms | 19 ms | 24x |
To enable SIMD compilation there are a couple of steps. First, you need to use the -msimd128
flag
when compiling the C file. This is the flag that tells Emscripten to use SIMD.
<Target Name="CompileNativeLibrary" BeforeTargets="BeforeBuild">
<Exec Command="emcc -msimd128 -c lib.c -O2 -o lib.o" />
</Target>
The same flag should be passed in to dotnet during the publish step.
/p:EmccExtraArgs="-msimd128 -s EXPORTED_FUNCTIONS=...
Now Emscripten will generate SIMD instructions.
The SIMD instructions can be used in C code per the Emscripten guidance.
Here is a function to perform a bitwise complement:
#include <stddef.h>
#include <wasm_simd128.h>
void bitwise_complement(uint8_t* ptr, int length) {
v128_t* simd_ptr = (v128_t*)ptr;
size_t num_vectors = length / sizeof(v128_t);
for (size_t i = 0; i < num_vectors; ++i) {
wasm_v128_store(
simd_ptr + i,
wasm_v128_not(wasm_v128_load(simd_ptr + i)));
}
}
For comparison, here is the “without SIMD” function used for comparison above:
[UnmanagedCallersOnly(EntryPoint = "BitwiseComplementSlow")]
public static unsafe void BitwiseComplementSlow(byte* f, int l)
{
Span<byte> span = new(f, l);
for (int i = 0; i < span.Length; i++)
{
span[i] = (byte)~span[i];
}
}
One thing you might have noticed in the above is the function takes a byte*
. WebAssembly can only
operate on its own memory space. I’ve added functions to allocate and free memory (you can export
malloc and free, but I wanted more control). My functions look like this:
[UnmanagedCallersOnly(EntryPoint = "Alloc")]
public static unsafe byte* Alloc(int length)
{
return (byte*)NativeMemory.AlignedAlloc(
(nuint)length,
(nuint)Vector<byte>.Count);
}
[UnmanagedCallersOnly(EntryPoint = "Free")]
public static unsafe void Free(byte* ptr)
{
NativeMemory.AlignedFree(ptr);
}
Here’s the web worker code for the example above. It does the following:
self
.onRuntimeInitialized
event.value * 1024
.BitwiseComplement
function 999 times, timing it.For the “slow” version it’s the same, except the call is to BitwiseComplementSlow
.
import "components/wasm/loader";
Module.onRuntimeInitialized = () => {
addEventListener("message", (event: MessageEvent<number>) => {
self._Init();
const size = event.data * 1024;
const buffer = self._Alloc(size);
const start = performance.now();
for (let i = 0; i < 999; ++i) {
self._BitwiseComplement(buffer, size);
}
const end = performance.now();
self._Free(buffer);
postMessage({ value: event.data, time: end - start });
});
postMessage({ value: 0, time: 0 });
};
Really this sort of thing might be temporary as I am sure the DotNet team will get around to
implementing the System.Numerics.Vector
type with SIMD instructions. But for now, the acceleration
can make the investment in manually doing it worth it.