Hell Is Microbenchmarking

Let’s say you have a collection of numbers and you want to know if a test number is included. Here are three common approaches:

Algorithm	Time Complexity	Notes
Linear Search	$O(n)$	Iterate through the list and check every item.
Binary Search	$O(logn)$	On a sorted list, eliminate half each iteration.
Hash Table	$O(1)$	Constant time lookup.

I planned an article about how for small input sizes, the time complexity doesn’t matter as much as the constant factors, and how a benchmark will show linear search performing better than the other options.

Unfortunately, for my sin of hubris and oversimplification, when I attempted to gather data supporting this concept I was dragged through three of the many circles of microbenchmarking hell:

Bad setup
Poor implementations
Unrepresentative scenarios

I emerged with data that sort-of supported my initial premise, but only under a narrow set of conditions. Read on to understand the journey and find some practical lessons.

⚠️ Update: I have discovered a fourth circle of microbenchmarking hell: writing an entire article based on Chrome/V8, only to publish it and observe Safari on iOS make a mockery of all of my results. The benchmarks in this article are all interactive, so whatever system you are on, try them yourself!

A Naive Beginning

My plan was simple. For my article, I would implement each of the three search algorithms in TypeScript, and embed a runner into the article itself. By timing how long the algorithms took for a large number of iterations, I could compare their performance.

A quick consultation with o3-mini-high gave me the implementations and a simple scaffold, and in under a quarter hour I was able to click the button to get the results. That’s when the trouble began.

Bad Setup

In the first circle of microbenchmarking hell, software engineers are doomed to attempt reproducible performance tests in wildly unstable environments.

After an initial win of seeing the tests compile and run to completion, I quickly ran into a major problem: the results were different with every run! It’s challenging to ensure consistency in performance tests. Applying lessons from a previous article I started with the following guidelines:

Run with an otherwise-idle machine
Don’t have the developer tools pane open
Run the tests in a background worker

For this article, I discovered three more:

next build then next start will show better results than next dev
A benchmarking framework is critical
Always observe the results of the functions you are testing

The last two items turned out to be particularly important.

Benchmarking Frameworks

I was getting dramatically varying results each time, even when averaging over a large number of iterations. As it turns out, there are many challenges with benchmarking JavaScript, and a do-it-yourself approach is not recommended.

I’m familiar with using Benchmark.NET, a stellar tool from the dotnet community, and Go Bench and BenchStat from Go, but I hadn’t used anything for JavaScript before. Upon searching, I found Benchmark.js, which, although recently archived, seems to be a solid option.

Even with this framework, I still found it best to let the benchmark run to at least length 2 then cancel and restart. This gave me the most repeatable results.

Observing Results

If you run a function without side effects, like Set.has, can an optimizer just eliminate it? Depending on the context, that might be the case. I don’t know exactly how this works in V8, but I was observing my tests of Set.has achieving unreasonably good results, until I added an && count++ to force it to run. Making the system you are benchmarking prove that it has done the work you expect through some tangible output is an important safeguard.

Are We Done?

Applying these controls to my benchmarking setup allowed me to obtain consistent results: consistently weird. It was time to fix my inconsistently optimized implementations.

Poor Implementations

Programmers in the second circle are eternally duped into forming false beliefs by quirky details of their implementations.

When comparing performance of algorithm implementations, they should probably be good implementations. In my case, I had problems with the initial versions of both binary search and linear search.

Binary Search

Here’s what I started with for binary search:

function binarySearch<T>(haystack: T[], needle: T): boolean {
  let low = 0;
  let high = haystack.length - 1;
  while (low <= high) {
    const mid = Math.floor((low + high) / 2);
    const item = haystack[mid];
    if (item === needle) return true;
    else if (item < needle) low = mid + 1;
    else high = mid - 1;
  }
  return false;
}

Looks good, a one-shot implementation straight from o3-mini-high. But it has a major performance issue caused by the Math.floor and the division. The calculation of mid can be optimized to:

const mid = low + ((high - low) >> 1);

Here’s the difference in performance with an input of 1 to 256 items.

To give binary search a fair shot, I of course picked the faster implementation.

Linear Search

Things got more complicated for linear search. Take these three implementations:

Iterator

function linearSearch<T>(haystack: T[], needle: T): boolean {
  for (const item of haystack) {
    if (item === needle) return true;
  }
  return false;
}

Classic Loop

function linearSearch<T>(haystack: T[], needle: T): boolean {
  const len = haystack.length;
  for (let i = 0; i < len; i++) {
    if (haystack[i] === needle) return true;
  }
  return false;
}

Built-In

function linearSearch<T>(haystack: T[], needle: T) {
  return haystack.includes(needle);
}

Which do you think performs best? It turns out, it depends!

The built-in implementation has some kind of speed boost between lengths 32 and 64. For the purposes of this article, since I was evaluating small input sizes, I opted to use the Classic Loop. But it was at this point I started to worry that my article might be in trouble.

Hash Table

I didn’t want to implement my own hash table, so I went with the built-in Set for JavaScript. A very good implementation — maybe too good as I discovered later on. At least this one has no room for adjustment!

function setSearch<T>(haystack: Set<T>, needle: T): boolean {
  return haystack.has(needle);
}

Now Are We Done?

With three solid implementations, I once more ran the comparison. I found that depending on the items in the collection and the items being searched, the relative performance varied. I was going to have to think more about exactly what scenario I was testing.

Unrepresentative Scenarios

In the third circle, developers toil to repeatedly construct and test elaborate scenarios only to realize they don’t bear any practical significance.

The initial test produced by o3-mini-high filled the collection with [0, 1, 2...], then searched for -1. So it was always searching for an item that wasn’t present. This is worst-case for the linear search and binary search, and it turns out it’s best-case for Chrome’s Set.has, which seems to be very fast when checking for an item not present at many small input sizes.

If this is your scenario, where the item almost always does not exist in the collection, then that’s perfectly fine to use in a benchmark. For the sake of my article, I decided that I would shift the scenario to evaluate the performance of checking for an item typically present.

But which item? The performance of binary search will also vary depending on the item being searched: if you always search for the item in the middle, it becomes an O(1) algorithm. So to eliminate this advantage, I changed the benchmark to check for each of the present items in a loop.

Even after adjusting the scenario, I found that Set still outperformed my expectations. In looking into it further, I discovered that although linear search and binary search don’t really care what the value of the integers are, Chrome’s Set does. It uses the integer as its own hash, so by using ascending positive integers, I was perfectly minimizing collisions. This is a best-case scenario for Set and probably not applicable to most real-world scenarios.

To mitigate this advantage, I decided the input would consist of random integers between 1 and 65536. I applied a PRNG with a fixed seed so the tests would be repeatable.

Finally, I became concerned that my chosen lengths for testing, powers of 2, would align with a cyclical performance boost or dip with Set based on how it grew. I decided to add more data points in between the powers of 2. This concern turned out to be unfounded, but at least now I can sleep at night.

Before the scenario adjustments, the results appear to show binary search almost always beating linear, and hash overtaking linear near length 8 and binary near length 16.

After the scenario adjustments, linear dominates until length 10, then hash takes over. There’s never a time when binary maintains a lead!

A Conclusion, But At What Cost?

The final result I obtained is only valid under narrow conditions:

When the item being searched for is almost always present in the collection.
With integer keys between 0 and 65536.
With a V8-based browser.

So what’s the point? Premature optimization continues to be the “root of all evil”? I think the main takeaway for me, beyond some experience benchmarking JavaScript, is that measuring performance remains hard and important.

It’s easy to make bad assumptions, whether based on experience, or even outdated measurements. If you want to make correct decisions for performance based on reality and not vibes, you’d better start measuring, and re-measuring has to be part of your ongoing quality control.

If you liked this article, let’s connect on X. Thanks for reading!