Let’s say you have a collection of numbers and you want to know if a test number is included. Here are three common approaches:
Algorithm | Time Complexity | Notes |
---|---|---|
Linear Search | Iterate through the list and check every item. | |
Binary Search | On a sorted list, eliminate half each iteration. | |
Hash Table | Constant time lookup. |
I planned an article about how for small input sizes, the time complexity doesn’t matter as much as the constant factors, and how a benchmark will show linear search performing better than the other options.
Unfortunately, for my sin of hubris and oversimplification, when I attempted to gather data supporting this concept I was dragged through three of the many circles of microbenchmarking hell:
I emerged with data that sort-of supported my initial premise, but only under a narrow set of conditions. Read on to understand the journey and find some practical lessons.
⚠️ Update: I have discovered a fourth circle of microbenchmarking hell: writing an entire article based on Chrome/V8, only to publish it and observe Safari on iOS make a mockery of all of my results. The benchmarks in this article are all interactive, so whatever system you are on, try them yourself!
My plan was simple. For my article, I would implement each of the three search algorithms in TypeScript, and embed a runner into the article itself. By timing how long the algorithms took for a large number of iterations, I could compare their performance.
A quick consultation with o3-mini-high gave me the implementations and a simple scaffold, and in under a quarter hour I was able to click the button to get the results. That’s when the trouble began.
In the first circle of microbenchmarking hell, software engineers are doomed to attempt reproducible performance tests in wildly unstable environments.
After an initial win of seeing the tests compile and run to completion, I quickly ran into a major problem: the results were different with every run! It’s challenging to ensure consistency in performance tests. Applying lessons from a previous article I started with the following guidelines:
For this article, I discovered three more:
next build
then next start
will show better results than next dev
The last two items turned out to be particularly important.
I was getting dramatically varying results each time, even when averaging over a large number of iterations. As it turns out, there are many challenges with benchmarking JavaScript, and a do-it-yourself approach is not recommended.
I’m familiar with using Benchmark.NET, a stellar tool from the dotnet community, and Go Bench and BenchStat from Go, but I hadn’t used anything for JavaScript before. Upon searching, I found Benchmark.js, which, although recently archived, seems to be a solid option.
Even with this framework, I still found it best to let the benchmark run to at least length 2 then cancel and restart. This gave me the most repeatable results.
If you run a function without side effects, like Set.has
, can an optimizer just eliminate it?
Depending on the context, that might be the case. I don’t know exactly how this works in V8, but I
was observing my tests of Set.has
achieving unreasonably good results, until I added an
&& count++
to force it to run. Making the system you are benchmarking prove that it has done the
work you expect through some tangible output is an important safeguard.
Applying these controls to my benchmarking setup allowed me to obtain consistent results: consistently weird. It was time to fix my inconsistently optimized implementations.
Programmers in the second circle are eternally duped into forming false beliefs by quirky details of their implementations.
When comparing performance of algorithm implementations, they should probably be good implementations. In my case, I had problems with the initial versions of both binary search and linear search.
Here’s what I started with for binary search:
function binarySearch<T>(haystack: T[], needle: T): boolean {
let low = 0;
let high = haystack.length - 1;
while (low <= high) {
const mid = Math.floor((low + high) / 2);
const item = haystack[mid];
if (item === needle) return true;
else if (item < needle) low = mid + 1;
else high = mid - 1;
}
return false;
}
Looks good, a one-shot implementation straight from o3-mini-high. But it has a major performance
issue caused by the Math.floor
and the division. The calculation of mid
can be optimized to:
const mid = low + ((high - low) >> 1);
Here’s the difference in performance with an input of 1 to 256 items.
To give binary search a fair shot, I of course picked the faster implementation.
Things got more complicated for linear search. Take these three implementations:
function linearSearch<T>(haystack: T[], needle: T): boolean {
for (const item of haystack) {
if (item === needle) return true;
}
return false;
}
function linearSearch<T>(haystack: T[], needle: T): boolean {
const len = haystack.length;
for (let i = 0; i < len; i++) {
if (haystack[i] === needle) return true;
}
return false;
}
function linearSearch<T>(haystack: T[], needle: T) {
return haystack.includes(needle);
}
Which do you think performs best? It turns out, it depends!
The built-in implementation has some kind of speed boost between lengths 32 and 64. For the purposes
of this article, since I was evaluating small input sizes, I opted to use the Classic Loop
. But it
was at this point I started to worry that my article might be in trouble.
I didn’t want to implement my own hash table, so I went with the built-in Set
for JavaScript. A
very good implementation — maybe too good as I discovered later on. At least this one has no room
for adjustment!
function setSearch<T>(haystack: Set<T>, needle: T): boolean {
return haystack.has(needle);
}
With three solid implementations, I once more ran the comparison. I found that depending on the items in the collection and the items being searched, the relative performance varied. I was going to have to think more about exactly what scenario I was testing.
In the third circle, developers toil to repeatedly construct and test elaborate scenarios only to realize they don’t bear any practical significance.
The initial test produced by o3-mini-high filled the collection with [0, 1, 2...]
, then searched
for -1
. So it was always searching for an item that wasn’t present. This is worst-case for the
linear search and binary search, and it turns out it’s best-case for Chrome’s Set.has
, which seems
to be very fast when checking for an item not present at many small input sizes.
If this is your scenario, where the item almost always does not exist in the collection, then that’s perfectly fine to use in a benchmark. For the sake of my article, I decided that I would shift the scenario to evaluate the performance of checking for an item typically present.
But which item? The performance of binary search will also vary depending on the item being searched: if you always search for the item in the middle, it becomes an O(1) algorithm. So to eliminate this advantage, I changed the benchmark to check for each of the present items in a loop.
Even after adjusting the scenario, I found that Set
still outperformed my expectations. In looking
into it further, I discovered that although linear search and binary search don’t really care what
the value of the integers are, Chrome’s Set
does. It uses the
integer as its own hash,
so by using ascending positive integers, I was perfectly minimizing collisions. This is a best-case
scenario for Set
and probably not applicable to most real-world scenarios.
To mitigate this advantage, I decided the input would consist of random integers between 1 and 65536. I applied a PRNG with a fixed seed so the tests would be repeatable.
Finally, I became concerned that my chosen lengths for testing, powers of 2, would align with a
cyclical performance boost or dip with Set
based on how it grew. I decided to add more data points
in between the powers of 2. This concern turned out to be unfounded, but at least now I can sleep at
night.
Before the scenario adjustments, the results appear to show binary search almost always beating linear, and hash overtaking linear near length 8 and binary near length 16.
After the scenario adjustments, linear dominates until length 10, then hash takes over. There’s never a time when binary maintains a lead!
The final result I obtained is only valid under narrow conditions:
So what’s the point? Premature optimization continues to be the “root of all evil”? I think the main takeaway for me, beyond some experience benchmarking JavaScript, is that measuring performance remains hard and important.
It’s easy to make bad assumptions, whether based on experience, or even outdated measurements. If you want to make correct decisions for performance based on reality and not vibes, you’d better start measuring, and re-measuring has to be part of your ongoing quality control.
If you liked this article, let’s connect on X. Thanks for reading!