Hash Table Benchmark - Analysis and Conclusion
A full walk-through of what the benchmark showed — across both integer and string keys, small and large values — and how to turn it into a concrete choice of hash table and hash function for a particular workload.
No single winner
The benchmark covers insertion, erasure, lookup (successful, unsuccessful, and a 50% mix), and iteration. It runs them on integer keys of several distributions and on strings of 12, 24 and 64 bytes, with both an 8-byte value and a 56-byte value (so the stored pair is 16 or 64 bytes), at sizes from 32 up to 10^7 elements, on two very different machines — an Intel Xeon E-2388G (Rocket Lake) and an Apple M1 Max. The one lesson that survives all of that variety is that the fastest combination is never the same twice: it changes with the operation that dominates the workload, with the type and distribution of the keys, with the size of the value, with how large the table is relative to the cache, and with the platform.
A second lesson, worth stating up front, is that a hash table and a hash function must be chosen together. Some tables assume the hash already spreads keys uniformly and break down badly if it does not; others do their own mixing and prefer the cheapest possible hash. Every chart therefore compares table + hash pairs rather than tables in isolation.
Lookups
Lookups are usually the operation people care about most, and they are where the design of the table matters most. A lookup breaks into three parts: computing the hash, mapping it to a slot, and loading and comparing keys. Which part dominates depends on the key type and on how much of the table fits in cache.
Integer keys
For integer keys a lookup's time is split between computing and
mapping the hash and the memory access that follows, and which one
dominates depends on how much of the table sits in cache. The hash is
not necessarily cheap: the identity std::hash costs nothing
and is fine when the keys are already spread out evenly, but keys whose
information sits in only a few bit positions — as with pointers and
aligned addresses (whose low bits are always zero) or small sequential
IDs (whose high bits are always zero) — collide badly under the identity
hash and need a real mixing hash such as absl::Hash (a
128-bit multiply and xor-shift) to scatter them across the table. The
right hash therefore depends on the keys (discussed further below). The
ranking moves through three regimes as the table grows. While it fits in
the L1/L2 cache the memory fetch is fast — only a few nanoseconds — so
the per-probe instruction count (the hash and the slot mapping) is
comparable to the fetch and is what separates the tables; the leanest
combination wins, and ska::flat_hash_map with the identity
std::hash is fastest at small sizes (about 1.3 ns per hit
at 1,024 elements on both machines), with
fph::DynamicFphMap a close second on the M1, trailing it by
only about 0.1 ns. In the middle of the range, where the table lives in
the L2/L3 caches, fph::DynamicFphMap takes the lead (about
3.4 ns at 200,000 and 10.9 ns at 1.2M on the Xeon) because its bounded
probe count keeps cache-line touches low. At the largest sizes, where
every lookup misses the cache and the memory access dominates,
ska::flat_hash_map is fastest again (about 14.7 ns on the
Xeon, 9.3 ns on the M1 at 10^7) because the table that reads the fewest
cache lines — a single compact slot array — wins regardless of the
hash.
Misses are more decisive. To prove a key absent, a table must rule
out every slot it could occupy, and tables that store a byte of metadata
per slot can do so without touching the full keys.
fph::MetaFphMap is in a class of its own: it rejects an
absent key with essentially one memory access, giving it the best
average miss time from about 6,000 elements up and a far tighter tail —
at 1.2M elements its P99 miss latency is about 34 ns, against about
106-111 ns for r_h::unordered_flat_map /
absl::flat_hash_map and 440-465 ns for the probe-walking
ska::flat_hash_map and tsl::robin_map. The
advantage lasts until the metadata array itself leaves the L3 cache
(around 10^7 here), after which it, too, pays one DRAM access.
absl::flat_hash_map is the best of the conventional tables
on misses, having been built around the same metadata idea. The 50%-hit
case is roughly the average of the two, with the alternating outcome
adding a branch-misprediction penalty that narrows every gap.
String keys
String keys change the picture because hashing costs much more — the whole string must be hashed and compared, rather than a single 64-bit word — and because longer strings may live on the heap. Two effects stand out.
First, the byte-oriented xxHash_xxh3 becomes the best
hash for almost every table, and increasingly so as strings lengthen:
hashing dominates, so a fast bytes hash matters more than the table.
Second, the in-cache lookup floor rises sharply with length. A
successful in-cache lookup costs about 1.3 ns for an integer key, about
6.5 ns for a 12-byte string, about 13 ns for a fixed 24-byte string, and
about 16 ns for a 64-byte string on the Xeon. The jump between 12 and 24
bytes is mostly the Small String Optimization (SSO): a string of up to
~15 bytes is stored inline in the std::string object, while
a longer one is heap-allocated, so a lookup must follow a pointer to a
separate cache line to compare the characters. This shows up directly in
the data — the max-length-24 keys, most of which are short
enough to stay inline, are about twice as fast as the
fixed-length-24 keys, which are all heap-allocated (about 7 ns
versus 13 ns at small sizes).
Apart from the higher floor, the ranking tracks the integer case: the
perfect-hash tables lead in cache (for 64-byte keys
fph::DynamicFphMap and fph::MetaFphMap with
xxHash_xxh3 are fastest for hits), the robin-hood tables
tsl::robin_map and ska::flat_hash_map overtake
once memory-bound, and fph::MetaFphMap again dominates
misses (about 8 ns at 1,024 and 60 ns at 1.2M for 64-byte keys, versus
about 63-68 ns for the next tables).
The effect of the value size
Enlarging the value from 8 to 56 bytes (a 64-byte pair) makes each
slot four times bigger, so fewer entries fit in each cache level and
every table becomes memory-bound earlier; a hit that cost about 15 ns at
10^7 with an 8-byte value costs about 21 ns with the 56-byte value. This
shifts the balance toward tables that touch the fewest cache lines:
fph::DynamicFphMap, with its bounded probe count, leads
across more of the range with the large value than it does with the
small one. If the values are large, weight the mid- and large-size
results more heavily than the small-size ones.
Building and modifying the table
Insertion inverts the lookup ranking for the perfect-hash tables. The
flat tables — absl::flat_hash_map,
ska::flat_hash_map, emhash::hash_map7,
tsl::robin_map, ankerl::unordered_dense_map —
are fastest to fill (about 4-6 ns per insert at small sizes and 25-35 ns
at 10^7 with capacity reserved), while std::unordered_map
is far slower (near 125 ns at 10^7) because it allocates a node per
element. The perfect-hash tables are slowest by a wide margin: building
a perfect hash costs fph::DynamicFphMap and
fph::MetaFphMap roughly 1,450-1,900 ns per element at 10^7
on the Xeon, about 12-15 times std::unordered_map (and
11-12 times on the M1). Dropping the reserve call roughly
doubles every table's insert time because growth then triggers repeated
rehashing.
The erase-insert test, which alternates one erase and one insert to
hold the size constant, favours the flat tables
(ska::flat_hash_map, tsl::robin_map,
absl::flat_hash_map); open-addressing tables leave
tombstones behind erased entries, and the perfect-hash tables fare
worst, timing out at the largest sizes because they cannot cheaply
absorb churn.
Key type and value size matter here too. With string keys, every insert and erase also pays the string hash, and for strings beyond the SSO limit a heap allocation and free for the characters — so the gap between 12-byte and 64-byte string workloads is large, while integer inserts are dominated purely by table mechanics. A larger value, as with lookups, pushes the memory-bound regime earlier.
Iteration
Iteration is decided almost entirely by storage layout, independent
of the hash function and nearly independent of the key type (the
iterator visits fixed-size table slots; it does not re-hash or, for
inline-stored entries, dereference the keys).
ankerl::unordered_dense_map and
emhash::hash_map7 keep their entries in a dense, contiguous
array and iterate in near-constant time per element regardless of load
factor — about 0.22 ns on the Xeon and 2.0 ns on the M1, flat across the
whole size range. The inline open-addressing tables must scan a sparse
slot array and skip empty slots, so their cost rises with size; the
node-based std::unordered_map is slowest, chasing
cache-missing pointers (about 35 ns per element at 10^7 on the Xeon).
The perfect-hash tables get no special benefit, since they also iterate
a sparse layout.
Memory and load factor
Footprint, measured on integer key-value pairs, splits the tables
into three groups at 10^7 elements. The one-metadata-byte SwissTables
absl::flat_hash_map and ska::bytell_hash_map
are most compact at about 272 MB; the node-based
std::unordered_map and absl::node_hash_map
come next (about 308 MB and 297 MB); the fph tables use
roughly twice the most compact (about 556-572 MB) for the index that
speeds their lookups; and ska::flat_hash_map and
tsl::robin_map are largest at about 768 MB, because they
keep a low maximum load factor and store the full, alignment-padded
value in every slot.
The load-factor chart explains that. Open-addressing tables grow by
doubling, so occupancy sawtooths between roughly 0.4 and 0.9;
ska::flat_hash_map and tsl::robin_map settle
to the lowest values (about 0.30 at large sizes), buying speed with
empty space, while std::unordered_map runs near 1.0 (a node
per element, no empty slots). Raising max_load_factor packs
the flat tables tighter — ska::flat_hash_map rises from
about 0.30 to 0.60 at 10^7, roughly halving its empty space — at the
cost of longer probes. Because footprint decides when a table crosses
each cache boundary, this is the structural reason behind the cache
tiers in the lookup results: a bulkier table falls out of cache at a
smaller element count.
The hash function matters as much as the table
For integer keys std::hash is the identity function. A
table that does its own mixing — most notably
ska::flat_hash_map — can use it safely and enjoy its zero
cost. But keys that are not uniformly random — those whose information
lives in only a handful of bit positions, like pointers or small
sequential IDs — make tables that assume a uniform hash
(absl::flat_hash_map, emhash::hash_map7,
tsl::robin_map) collide catastrophically with the identity
hash, badly enough that some combinations time out during construction,
so they need a good mixing hash like absl::Hash. Even a
good hash can have weak spots: robin_hood::hash spreads
some such key patterns poorly, producing an irregular mid-range slump
for tsl::robin_map with that hash. For string keys,
xxHash_xxh3 wins across the board.
Platform differences: Rocket Lake vs M1 Max
The two machines have very different memory systems. The Xeon E-2388G
(Rocket Lake) has a 48 KB L1, 512 KB L2 and 16 MB L3 with 4 KB pages;
the M1 Max has a 128 KB L1, 12 MB L2 and 48 MB system-level cache with
16 KB pages. The larger caches and pages on the M1 push every cache-tier
boundary to the right and reduce TLB pressure, so although the
ranking of tables is broadly consistent across platforms, the
sizes at which one overtakes another differ. One quirk: the libc++
std::unordered_map on the M1 indexes buckets with a modulo
that degenerates to a power-of-two mask when the bucket count is a power
of two, discarding the high bits of the hash, which makes that
combination anomalously slow at element counts that are exact powers of
two.
Choosing a hash table
Two practical questions decide most of the choice: how often is the table written versus read, and how big is it relative to the cache it actually gets.
On the first question: if the table is built once and then queried
far more than it is modified, the perfect-hash
fph::DynamicFphMap (or fph::MetaFphMap when
misses are common) gives the best lookup performance — at the cost of a
slow build, an inability to absorb frequent updates, and roughly twice
the memory of the most compact tables, so it suits a static dictionary,
a read-mostly index or a membership set far better than a constantly
changing table. If inserts, erases and lookups are mixed, a flat table
is the better all-rounder, and absl::flat_hash_map with
absl::Hash (or xxHash_xxh3 for strings) is a
strong, compact, distribution-robust default.
The second question is where "fast in cache" needs unpacking. A table
is "in cache" when its whole footprint — not the element count alone —
fits in a cache level: on the Xeon, roughly 16,000 small (16-byte)
entries fit in the L2 and about a million in the L3, and proportionally
fewer when the value is large or the keys are heap-allocated strings.
Therefore, ska::flat_hash_map, the fastest table while
everything stays resident, is the right pick mainly for genuinely small
tables, or when the memory can be spared. The catch is that
ska::flat_hash_map also has the largest footprint
of all (its low load factor is exactly what makes it fast), so for the
same number of elements it leaves cache sooner than a compact table, and
it competes harder for whatever cache is left to the rest of the
program. In a real application the cache is shared with everything else
running, so the effective threshold is well below the raw cache size; if
memory is tight, the cache is contended, or the table is large, the
compact absl::flat_hash_map or
ska::bytell_hash_map will usually beat it despite
ska::flat_hash_map's edge in a benchmark that owns the
whole cache. For iteration-heavy work,
ankerl::unordered_dense_map is the clear choice regardless
of size.
| Your situation | Consider |
|---|---|
| build once, then look up a lot (read-mostly / static) | fph::DynamicFphMap (hit-heavy) or
fph::MetaFphMap (miss-heavy) |
| general-purpose, mixed insert / erase / lookup | absl::flat_hash_map (with absl::Hash, or
xxHash_xxh3 for strings) — fast, compact, robust |
| small table, or plenty of spare and uncontended cache, want lowest lookup time | ska::flat_hash_map — but it is the largest table, so
prefer a compact one once it grows |
| memory tight, cache shared/contended, or table large | absl::flat_hash_map or
ska::bytell_hash_map (most compact) |
| dominated by iteration / frequent full scans | ankerl::unordered_dense_map |
| need reference / pointer stability | absl::node_hash_map or
std::unordered_map |
Final advice
std::unordered_map is the slowest table in almost every
throughput test because of its node-per-element layout, so it is worth
keeping mainly when its iterator and pointer-stability guarantees are
actually required. Beyond that, the table above is a starting point, not
a verdict: the full space of tables, hashes, key distributions, value
sizes, table sizes and platforms is enormous, and a real workload may
land between the cases measured here. The surest answer is always to
benchmark the two or three most promising candidates on the actual data
and operation mix.
- Link: https://renzibei.com/en/analysis-and-conclusion/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.