Hash Table Benchmark - Integer Lookup Latency

Lookup keys in the table (hit)
1. Use default max_load_factor
2. Use large max_load_factor
Lookup keys not in the table (miss)
1. Use default max_load_factor
2. Use large max_load_factor
Lookup keys with a 50% probability in the table
1. Use default max_load_factor
2. Use large max_load_factor

The integer key lookup latency test.

📄 Raw CSV data for these (x86-only) latency measurements: Xeon E-2388G.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we measure the lookup latency of hash tables in three kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessful find).
Look up keys with a 50% probability of being in the hash table.

Latency tells a different story from throughput. The throughput test keeps many independent lookups in flight, so the CPU's out-of-order engine overlaps their memory accesses. The P99 latency below is instead the 99th-percentile time of a single lookup, so it captures the worst accesses that cannot be hidden — a lookup that misses several cache lines, walks a long probe sequence, or mispredicts a branch. Latency is measured on the x86-64 platform (Xeon E-2388G) only.

Two regimes dominate every chart. While the table still fits in cache, the tail is set by how many memory accesses the worst-case lookup performs: tables that bound their probe count (the perfect-hash fph::* maps) or reject a key quickly through metadata keep a tight tail, while tables that may walk a long probe chain under collisions show a heavier one. Once the table spills out of the L3 cache, almost every 99th-percentile lookup incurs at least one DRAM access, so the tail flattens onto a memory-latency floor of roughly 420-460 ns on this Rocket Lake platform and the choice of table matters far less for the tail than it does for throughput. In each group the second chart sets a large max_load_factor, which packs the tables more tightly and slightly lengthens the probe chains, but leaves the qualitative ordering unchanged.

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

For hit lookups the ordering shifts with size. At small sizes the simple open-addressing tables have the tightest tail — ska::flat_hash_map with std::hash reaches a P99 of about 7.8 ns at 1,024 elements. In the L2/L3 regime the perfect-hash tables take over: at 32,768 elements fph::DynamicFphMap and fph::MetaFphMap with std::hash have the best P99 (about 22 ns), because they guarantee a small, bounded number of probes even in the worst case. Beyond about one million elements every table converges to the ~420-460 ns DRAM floor; the node-based std::unordered_map is the clear outlier, since a tail lookup chases two cache-missing pointers and reaches 655-765 ns.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Use large max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

The miss case is where the metadata design shines most. fph::MetaFphMap can confirm that a key is absent by reading a single metadata cache line, without walking any probe sequence. As long as that metadata array fits in cache this gives it a dramatically tighter tail than any other table: at 1,200,000 elements its P99 miss latency is about 34 ns, while the other tables range from roughly 106 ns (r_h::unordered_flat_map, absl::flat_hash_map) up to 440-460 ns (ska::flat_hash_map, tsl::robin_map) — a more-than-tenfold gap at the tail. The advantage is largest in the L3 regime and disappears at 10,000,000 elements, where the metadata array itself no longer fits in the L3 cache, so even a single metadata read becomes a DRAM miss and fph::MetaFphMap falls back to the common ~450 ns floor. This is exactly the workload fph::MetaFphMap was designed for, and the main reason to prefer it over fph::DynamicFphMap when unsuccessful lookups are common.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Use large max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

The 50% case mixes hits and misses, and the alternating outcome adds a branch-misprediction penalty that narrows the gaps between tables. The perfect-hash maps still hold a mild tail advantage in the in-cache regime (fph::DynamicFphMap is around 27 ns at 32,768 elements), but once the working set leaves the cache the memory-latency floor dominates again and the tables become hard to separate, with only the node-based tables clearly behind (absl::node_hash_map at about 520-620 ns and std::unordered_map at about 675-785 ns at the largest sizes).

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Use large max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

← Back to Hash Table Benchmark index

Author: Renzibei
Link: https://renzibei.com/en/int-lookup-latency/
Copyright Notice: All articles on this website, unless otherwise stated, are created by the author and are licensed under CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.

post in 2026-06-13

tags:

{ hashtable }

{ benchmark }

{ algorithm }