Hash Table Benchmark - Analysis and Conclusion

A full walk-through of what the benchmark showed — across both integer and string keys, small and large values — and how to turn it into a concrete choice of hash table and hash function for a particular workload.

No single winner

The benchmark covers insertion, erasure, lookup (successful, unsuccessful, and a 50% mix), and iteration. It runs them on integer keys of several distributions and on strings of 12, 24 and 64 bytes, with both an 8-byte value and a 56-byte value (so the stored pair is 16 or 64 bytes), at sizes from 32 up to 10^7 elements, on two very different machines — an Intel Xeon E-2388G (Rocket Lake) and an Apple M1 Max. The one lesson that survives all of that variety is that the fastest combination is never the same twice: it changes with the operation that dominates the workload, with the type and distribution of the keys, with the size of the value, with how large the table is relative to the cache, and with the platform.

A second lesson, worth stating up front, is that a hash table and a hash function must be chosen together. Some tables assume the hash already spreads keys uniformly and break down badly if it does not; others do their own mixing and prefer the cheapest possible hash. Every chart therefore compares table + hash pairs rather than tables in isolation.

Lookups

Lookups are usually the operation people care about most, and they are where the design of the table matters most. A lookup breaks into three parts: computing the hash, mapping it to a slot, and loading and comparing keys. Which part dominates depends on the key type and on how much of the table fits in cache.

Integer keys

For integer keys a lookup's time is split between computing and mapping the hash and the memory access that follows, and which one dominates depends on how much of the table sits in cache. The hash is not necessarily cheap: the identity std::hash costs nothing and is fine when the keys are already spread out evenly, but keys whose information sits in only a few bit positions — as with pointers and aligned addresses (whose low bits are always zero) or small sequential IDs (whose high bits are always zero) — collide badly under the identity hash and need a real mixing hash such as absl::Hash (a 128-bit multiply and xor-shift) to scatter them across the table. The right hash therefore depends on the keys (discussed further below). The ranking moves through three regimes as the table grows. While it fits in the L1/L2 cache the memory fetch is fast — only a few nanoseconds — so the per-probe instruction count (the hash and the slot mapping) is comparable to the fetch and is what separates the tables; the leanest combination wins, and ska::flat_hash_map with the identity std::hash is fastest at small sizes (about 1.3 ns per hit at 1,024 elements on both machines), with fph::DynamicFphMap a close second on the M1, trailing it by only about 0.1 ns. In the middle of the range, where the table lives in the L2/L3 caches, fph::DynamicFphMap takes the lead (about 3.4 ns at 200,000 and 10.9 ns at 1.2M on the Xeon) because its bounded probe count keeps cache-line touches low. At the largest sizes, where every lookup misses the cache and the memory access dominates, ska::flat_hash_map is fastest again (about 14.7 ns on the Xeon, 9.3 ns on the M1 at 10^7) because the table that reads the fewest cache lines — a single compact slot array — wins regardless of the hash.

Misses are more decisive. To prove a key absent, a table must rule out every slot it could occupy, and tables that store a byte of metadata per slot can do so without touching the full keys. fph::MetaFphMap is in a class of its own: it rejects an absent key with essentially one memory access, giving it the best average miss time from about 6,000 elements up and a far tighter tail — at 1.2M elements its P99 miss latency is about 34 ns, against about 106-111 ns for r_h::unordered_flat_map / absl::flat_hash_map and 440-465 ns for the probe-walking ska::flat_hash_map and tsl::robin_map. The advantage lasts until the metadata array itself leaves the L3 cache (around 10^7 here), after which it, too, pays one DRAM access. absl::flat_hash_map is the best of the conventional tables on misses, having been built around the same metadata idea. The 50%-hit case is roughly the average of the two, with the alternating outcome adding a branch-misprediction penalty that narrows every gap.

String keys

String keys change the picture because hashing costs much more — the whole string must be hashed and compared, rather than a single 64-bit word — and because longer strings may live on the heap. Two effects stand out.

First, the byte-oriented xxHash_xxh3 becomes the best hash for almost every table, and increasingly so as strings lengthen: hashing dominates, so a fast bytes hash matters more than the table. Second, the in-cache lookup floor rises sharply with length. A successful in-cache lookup costs about 1.3 ns for an integer key, about 6.5 ns for a 12-byte string, about 13 ns for a fixed 24-byte string, and about 16 ns for a 64-byte string on the Xeon. The jump between 12 and 24 bytes is mostly the Small String Optimization (SSO): a string of up to ~15 bytes is stored inline in the std::string object, while a longer one is heap-allocated, so a lookup must follow a pointer to a separate cache line to compare the characters. This shows up directly in the data — the max-length-24 keys, most of which are short enough to stay inline, are about twice as fast as the fixed-length-24 keys, which are all heap-allocated (about 7 ns versus 13 ns at small sizes).

Apart from the higher floor, the ranking tracks the integer case: the perfect-hash tables lead in cache (for 64-byte keys fph::DynamicFphMap and fph::MetaFphMap with xxHash_xxh3 are fastest for hits), the robin-hood tables tsl::robin_map and ska::flat_hash_map overtake once memory-bound, and fph::MetaFphMap again dominates misses (about 8 ns at 1,024 and 60 ns at 1.2M for 64-byte keys, versus about 63-68 ns for the next tables).

The effect of the value size

Enlarging the value from 8 to 56 bytes (a 64-byte pair) makes each slot four times bigger, so fewer entries fit in each cache level and every table becomes memory-bound earlier; a hit that cost about 15 ns at 10^7 with an 8-byte value costs about 21 ns with the 56-byte value. This shifts the balance toward tables that touch the fewest cache lines: fph::DynamicFphMap, with its bounded probe count, leads across more of the range with the large value than it does with the small one. If the values are large, weight the mid- and large-size results more heavily than the small-size ones.

Building and modifying the table

Insertion inverts the lookup ranking for the perfect-hash tables. The flat tables — absl::flat_hash_map, ska::flat_hash_map, emhash::hash_map7, tsl::robin_map, ankerl::unordered_dense_map — are fastest to fill (about 4-6 ns per insert at small sizes and 25-35 ns at 10^7 with capacity reserved), while std::unordered_map is far slower (near 125 ns at 10^7) because it allocates a node per element. The perfect-hash tables are slowest by a wide margin: building a perfect hash costs fph::DynamicFphMap and fph::MetaFphMap roughly 1,450-1,900 ns per element at 10^7 on the Xeon, about 12-15 times std::unordered_map (and 11-12 times on the M1). Dropping the reserve call roughly doubles every table's insert time because growth then triggers repeated rehashing.

The erase-insert test, which alternates one erase and one insert to hold the size constant, favours the flat tables (ska::flat_hash_map, tsl::robin_map, absl::flat_hash_map); open-addressing tables leave tombstones behind erased entries, and the perfect-hash tables fare worst, timing out at the largest sizes because they cannot cheaply absorb churn.

Key type and value size matter here too. With string keys, every insert and erase also pays the string hash, and for strings beyond the SSO limit a heap allocation and free for the characters — so the gap between 12-byte and 64-byte string workloads is large, while integer inserts are dominated purely by table mechanics. A larger value, as with lookups, pushes the memory-bound regime earlier.

Iteration

Iteration is decided almost entirely by storage layout, independent of the hash function and nearly independent of the key type (the iterator visits fixed-size table slots; it does not re-hash or, for inline-stored entries, dereference the keys). ankerl::unordered_dense_map and emhash::hash_map7 keep their entries in a dense, contiguous array and iterate in near-constant time per element regardless of load factor — about 0.22 ns on the Xeon and 2.0 ns on the M1, flat across the whole size range. The inline open-addressing tables must scan a sparse slot array and skip empty slots, so their cost rises with size; the node-based std::unordered_map is slowest, chasing cache-missing pointers (about 35 ns per element at 10^7 on the Xeon). The perfect-hash tables get no special benefit, since they also iterate a sparse layout.

Memory and load factor

Footprint, measured on integer key-value pairs, splits the tables into three groups at 10^7 elements. The one-metadata-byte SwissTables absl::flat_hash_map and ska::bytell_hash_map are most compact at about 272 MB; the node-based std::unordered_map and absl::node_hash_map come next (about 308 MB and 297 MB); the fph tables use roughly twice the most compact (about 556-572 MB) for the index that speeds their lookups; and ska::flat_hash_map and tsl::robin_map are largest at about 768 MB, because they keep a low maximum load factor and store the full, alignment-padded value in every slot.

The load-factor chart explains that. Open-addressing tables grow by doubling, so occupancy sawtooths between roughly 0.4 and 0.9; ska::flat_hash_map and tsl::robin_map settle to the lowest values (about 0.30 at large sizes), buying speed with empty space, while std::unordered_map runs near 1.0 (a node per element, no empty slots). Raising max_load_factor packs the flat tables tighter — ska::flat_hash_map rises from about 0.30 to 0.60 at 10^7, roughly halving its empty space — at the cost of longer probes. Because footprint decides when a table crosses each cache boundary, this is the structural reason behind the cache tiers in the lookup results: a bulkier table falls out of cache at a smaller element count.

The hash function matters as much as the table

For integer keys std::hash is the identity function. A table that does its own mixing — most notably ska::flat_hash_map — can use it safely and enjoy its zero cost. But keys that are not uniformly random — those whose information lives in only a handful of bit positions, like pointers or small sequential IDs — make tables that assume a uniform hash (absl::flat_hash_map, emhash::hash_map7, tsl::robin_map) collide catastrophically with the identity hash, badly enough that some combinations time out during construction, so they need a good mixing hash like absl::Hash. Even a good hash can have weak spots: robin_hood::hash spreads some such key patterns poorly, producing an irregular mid-range slump for tsl::robin_map with that hash. For string keys, xxHash_xxh3 wins across the board.

Platform differences: Rocket Lake vs M1 Max

The two machines have very different memory systems. The Xeon E-2388G (Rocket Lake) has a 48 KB L1, 512 KB L2 and 16 MB L3 with 4 KB pages; the M1 Max has a 128 KB L1, 12 MB L2 and 48 MB system-level cache with 16 KB pages. The larger caches and pages on the M1 push every cache-tier boundary to the right and reduce TLB pressure, so although the ranking of tables is broadly consistent across platforms, the sizes at which one overtakes another differ. One quirk: the libc++ std::unordered_map on the M1 indexes buckets with a modulo that degenerates to a power-of-two mask when the bucket count is a power of two, discarding the high bits of the hash, which makes that combination anomalously slow at element counts that are exact powers of two.

Choosing a hash table

Two practical questions decide most of the choice: how often is the table written versus read, and how big is it relative to the cache it actually gets.

On the first question: if the table is built once and then queried far more than it is modified, the perfect-hash fph::DynamicFphMap (or fph::MetaFphMap when misses are common) gives the best lookup performance — at the cost of a slow build, an inability to absorb frequent updates, and roughly twice the memory of the most compact tables, so it suits a static dictionary, a read-mostly index or a membership set far better than a constantly changing table. If inserts, erases and lookups are mixed, a flat table is the better all-rounder, and absl::flat_hash_map with absl::Hash (or xxHash_xxh3 for strings) is a strong, compact, distribution-robust default.

The second question is where "fast in cache" needs unpacking. A table is "in cache" when its whole footprint — not the element count alone — fits in a cache level: on the Xeon, roughly 16,000 small (16-byte) entries fit in the L2 and about a million in the L3, and proportionally fewer when the value is large or the keys are heap-allocated strings. Therefore, ska::flat_hash_map, the fastest table while everything stays resident, is the right pick mainly for genuinely small tables, or when the memory can be spared. The catch is that ska::flat_hash_map also has the largest footprint of all (its low load factor is exactly what makes it fast), so for the same number of elements it leaves cache sooner than a compact table, and it competes harder for whatever cache is left to the rest of the program. In a real application the cache is shared with everything else running, so the effective threshold is well below the raw cache size; if memory is tight, the cache is contended, or the table is large, the compact absl::flat_hash_map or ska::bytell_hash_map will usually beat it despite ska::flat_hash_map's edge in a benchmark that owns the whole cache. For iteration-heavy work, ankerl::unordered_dense_map is the clear choice regardless of size.

Your situation Consider
build once, then look up a lot (read-mostly / static) fph::DynamicFphMap (hit-heavy) or fph::MetaFphMap (miss-heavy)
general-purpose, mixed insert / erase / lookup absl::flat_hash_map (with absl::Hash, or xxHash_xxh3 for strings) — fast, compact, robust
small table, or plenty of spare and uncontended cache, want lowest lookup time ska::flat_hash_map — but it is the largest table, so prefer a compact one once it grows
memory tight, cache shared/contended, or table large absl::flat_hash_map or ska::bytell_hash_map (most compact)
dominated by iteration / frequent full scans ankerl::unordered_dense_map
need reference / pointer stability absl::node_hash_map or std::unordered_map

Final advice

std::unordered_map is the slowest table in almost every throughput test because of its node-per-element layout, so it is worth keeping mainly when its iterator and pointer-stability guarantees are actually required. Beyond that, the table above is a starting point, not a verdict: the full space of tables, hashes, key distributions, value sizes, table sizes and platforms is enormous, and a real workload may land between the cases measured here. The surest answer is always to benchmark the two or three most promising candidates on the actual data and operation mix.


← Back to Hash Table Benchmark index