Hash Table Benchmark - 64 Byte String Lookup

Throughput
Latency

The 64 byte string lookup test.

📄 Raw CSV data for this benchmark: Xeon E-2388G · M1 Max.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we measure the lookup performance of hash tables in three kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessful find).
Look up keys with a 50% probability of being in the hash table.

There are two kinds of keys in this test: strings with a fixed length of 64 bytes, and strings with a max length of 64 bytes.

At 64 bytes every fixed-length key is well past the Small String Optimization limit, so all of them live on the heap and a lookup must follow a pointer to reach the characters before hashing or comparing them. The key is now four cache lines of data, which makes the hash function cost the dominant term in the lookup: feeding 64 bytes through the hash takes far longer than the slot arithmetic. As a result xxHash_xxh3, which is tuned for byte throughput, is the per-table winner for essentially every table on both machines in this test, and the differences between table layouts shrink because they all pay the same large hashing and pointer-chasing cost. The fixed-64 and max-64 variants again differ in that the max-length keys include many short, SSO-eligible strings that skip the heap indirection. Charts pair the Xeon E-2388G with the M1 Max for throughput; latency is Xeon-only.

Throughput

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

With hashing 64 bytes dominating, the field is closer together than in the shorter-key tests, but the ordering is still informative. On the Xeon the perfect-hash fph::DynamicFphMap and fph::MetaFphMap (xxh3) are fastest in cache (about 16.4 and 16.8 ns at 1,024, 54.6 and 56.5 ns at 32,768) thanks to their single-probe guarantee, which still matters even when the hash is expensive. At 10^7 they remain near the front (152.7 and 163.5 ns) but the short-probe robin-hood tables tsl::robin_map and ska::flat_hash_map catch up (151.7 and 152.2 ns). Every table uses xxh3 as its best hash here. The node-based std::unordered_map is clearly behind at 246.7 ns, paying a heap node dereference on top of the already-heavy key dereference.

The M1 Max shows the perfect-hash tables leading more decisively, with fph::DynamicFphMap fastest across the whole range (15.5 ns at 32,768, 112.7 ns at 10^7), helped by the large M1 caches keeping its sparser array resident. std::unordered_map is again last at 184.5 ns.

The max-length variant below is markedly faster, for instance tsl::robin_map runs a hit in about 9.9 ns at 1,024 versus 18.4 ns for fixed-64, because the short SSO-eligible keys avoid both the heap dereference and the cost of hashing a full 64 bytes. The large-max_load_factor charts keep the same ranking with denser packing.

<K,V>: <string with a max length of 64, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

<K,V>: <string with a max length of 64, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

Misses are again the case where fph::MetaFphMap stands out: its metadata lets it reject an absent key without dereferencing the heap-stored characters or comparing any bytes, so on the Xeon it leads at 8.2 ns at 1,024, 16.9 ns at 32,768, and stays fastest to 10^7 (95.5 ns). The SwissTable tables absl::flat_hash_map and ankerl::unordered_dense_map follow closely, while the robin-hood tables tsl::robin_map and ska::flat_hash_map fall behind in the mid-range (41.6 and 43.1 ns at 32,768) due to longer probe runs. Avoiding the full 64-byte compare matters a lot here, so the metadata and SwissTable schemes that short-circuit on a tag byte pull noticeably ahead. The M1 Max amplifies the metadata advantage: fph::MetaFphMap answers a miss in 5.9-18 ns up to 200,000 elements and just 54.2 ns at 10^7. std::unordered_map is the slowest at scale on both machines (231 ns Xeon, 125 ns M1 at 10^7).

The max-length variant follows the same ranking with smaller numbers, and the large-max_load_factor charts are consistent with this pattern.

<K,V>: <string with a max length of 64, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

<K,V>: <string with a max length of 64, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

The mixed workload sits between the two cases. On the Xeon absl::flat_hash_map, r_h::unordered_flat_map and fph::MetaFphMap (all xxh3) share the lead, around 11.7-12 ns at 1,024 and 137-148 ns at 10^7, with std::unordered_map trailing at 235 ns. On the M1 Max fph::MetaFphMap is fastest across most sizes (22.8 ns at 32,768, 104.2 ns at 10^7), since the miss half of the workload rewards its metadata while the hit half still benefits from its single-probe layout. The max-length variant and the large-max_load_factor appendix charts follow the same pattern.

<K,V>: <string with a max length of 64, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

<K,V>: <string with a max length of 64, uint64_t>

Latency

The P99 latency charts (Xeon only) show the tail. With 64-byte heap-allocated keys, a worst-case lookup can miss on the slot array, on the key's heap buffer, and on the page table, so the tails are the heaviest of the three string tests.

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

For fixed-64 hits the tails converge sharply: from about 67-92 ns at 1,024 they jump to the 550-620 ns range already at 32,768 and then to roughly 830-955 ns at 10^7 for the plotted tables, where r_h::unordered_flat_map (xxh3) is at the front at 830 ns and absl::node_hash_map is the slowest of them at 955 ns. std::unordered_map is slower still, but its tail runs past the chart's 1,000 ns display limit (about 1,025 ns at 10^7), so it is not drawn here. The steep early rise reflects the guaranteed heap miss on the key bytes once the buffers no longer fit in cache. The max-length variant has lower tails (the leading tables run roughly 290-330 ns at 32,768 versus 550+ for fixed) because the inline short keys spare many lookups that miss.

<K,V>: <string with a max length of 64, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

<K,V>: <string with a max length of 64, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

On the miss path fph::MetaFphMap keeps the lowest tail in cache (24.8 ns at 1,024, 105.6 ns at 32,768) because its metadata settles a miss without touching the heap-stored key bytes, and ankerl::unordered_dense_map is the next best. The robin-hood tables blow up earliest, past 400 ns at 32,768. The max-64 variant pushes the tail-blowup point out considerably, the leaders staying near 52-62 ns at 32,768, because most missing keys are short enough to be rejected from inline data. As at 24 bytes, this shows that the SSO boundary shapes the latency tail as much as the table algorithm does.

<K,V>: <string with a max length of 64, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

<K,V>: <string with a max length of 64, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

With half the queries hitting, the tail is governed by the heavier hit path and most fixed-64 curves bunch into the 805-890 ns band at 10^7 (many tables cluster near 805-845 ns, with fph::MetaFphMap the highest among them at about 890 ns); std::unordered_map again runs past the chart's 1,000 ns display limit (about 1,065 ns) and is not drawn. The metadata advantage that fph::MetaFphMap enjoys on pure misses is diluted because the hit half still requires the heap dereference and 64-byte compare. The max-length variant and the large-max_load_factor appendix charts follow the same pattern.

<K,V>: <string with a max length of 64, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 64, uint64_t>

<K,V>: <string with a max length of 64, uint64_t>

← Back to Hash Table Benchmark index

Author: Renzibei
Link: https://renzibei.com/en/64-byte-string-lookup/
Copyright Notice: All articles on this website, unless otherwise stated, are created by the author and are licensed under CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.

post in 2026-06-13

tags:

{ hashtable }

{ benchmark }

{ algorithm }