Hash Table Benchmark - Integer Lookup Latency
The integer key lookup latency test.
Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
In this test, we measure the lookup latency of hash tables in three kinds of situations:
- Look up the keys in the hash table (hit or successful find).
- Look up the keys not in the hash table (miss or unsuccessful find).
- Look up keys with a 50% probability of being in the hash table.
Latency tells a different story from throughput. The throughput test keeps many independent lookups in flight, so the CPU's out-of-order engine overlaps their memory accesses. The P99 latency below is instead the 99th-percentile time of a single lookup, so it captures the worst accesses that cannot be hidden — a lookup that misses several cache lines, walks a long probe sequence, or mispredicts a branch. Latency is measured on the x86-64 platform (Xeon E-2388G) only.
Two regimes dominate every chart. While the table still fits in
cache, the tail is set by how many memory accesses the worst-case lookup
performs: tables that bound their probe count (the perfect-hash
fph::* maps) or reject a key quickly through metadata keep
a tight tail, while tables that may walk a long probe chain under
collisions show a heavier one. Once the table spills out of the L3
cache, almost every 99th-percentile lookup incurs at least one DRAM
access, so the tail flattens onto a memory-latency floor of roughly
420-460 ns on this Rocket Lake platform and the choice of table matters
far less for the tail than it does for throughput. In each group the
second chart sets a large max_load_factor, which packs the
tables more tightly and slightly lengthens the probe chains, but leaves
the qualitative ordering unchanged.
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
For hit lookups the ordering shifts with size. At small sizes the
simple open-addressing tables have the tightest tail —
ska::flat_hash_map with std::hash reaches a
P99 of about 7.8 ns at 1,024 elements. In the L2/L3 regime the
perfect-hash tables take over: at 32,768 elements
fph::DynamicFphMap and fph::MetaFphMap with
std::hash have the best P99 (about 22 ns), because they
guarantee a small, bounded number of probes even in the worst case.
Beyond about one million elements every table converges to the ~420-460
ns DRAM floor; the node-based std::unordered_map is the
clear outlier, since a tail lookup chases two cache-missing pointers and
reaches 655-765 ns.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Use large max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
The miss case is where the metadata design shines most.
fph::MetaFphMap can confirm that a key is absent by reading
a single metadata cache line, without walking any probe sequence. As
long as that metadata array fits in cache this gives it a dramatically
tighter tail than any other table: at 1,200,000 elements its P99 miss
latency is about 34 ns, while the other tables range from roughly 106 ns
(r_h::unordered_flat_map, absl::flat_hash_map)
up to 440-460 ns (ska::flat_hash_map,
tsl::robin_map) — a more-than-tenfold gap at the tail. The
advantage is largest in the L3 regime and disappears at 10,000,000
elements, where the metadata array itself no longer fits in the L3
cache, so even a single metadata read becomes a DRAM miss and
fph::MetaFphMap falls back to the common ~450 ns floor.
This is exactly the workload fph::MetaFphMap was designed
for, and the main reason to prefer it over
fph::DynamicFphMap when unsuccessful lookups are
common.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Use large max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
The 50% case mixes hits and misses, and the alternating outcome adds
a branch-misprediction penalty that narrows the gaps between tables. The
perfect-hash maps still hold a mild tail advantage in the in-cache
regime (fph::DynamicFphMap is around 27 ns at 32,768
elements), but once the working set leaves the cache the memory-latency
floor dominates again and the tables become hard to separate, with only
the node-based tables clearly behind (absl::node_hash_map
at about 520-620 ns and std::unordered_map at about 675-785
ns at the largest sizes).
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Use large max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
- Link: https://renzibei.com/en/int-lookup-latency/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.