Hash Table Benchmark - 24 Byte String Lookup

Throughput
Latency

The 24 byte string lookup test.

📄 Raw CSV data for this benchmark: Xeon E-2388G · M1 Max.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we measure the lookup performance of hash tables in three kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessful find).
Look up keys with a 50% probability of being in the hash table.

There are two kinds of keys in this test: strings with a fixed length of 24 bytes, and strings with a max length of 24 bytes.

At 24 bytes the keys have crossed the Small String Optimization boundary: a std::string of this length no longer fits inline (libstdc++ stores up to 15 bytes inline), so the fixed-24 keys are all heap-allocated and the lookup must dereference a pointer to reach the characters before it can hash or compare them. This adds a near-guaranteed cache miss per key to the hash/compare cost discussed in the 12-byte post, and it makes the fixed-length and max-length variants behave quite differently: in the max-24 case many keys are short enough to stay inline, avoiding that extra indirection. The four hashes tested are again std::hash, absl::Hash, robin_hood::hash, and xxHash_xxh3; with more bytes to digest per key, the bytes-optimized xxh3 now wins for almost every table. Each chart shows the best hash per table; Xeon E-2388G and M1 Max throughput are paired and latency is Xeon-only.

Throughput

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

Two things stand out compared to the 12-byte test. First, xxHash_xxh3 is now the per-table winner for essentially every table on the Xeon, because the larger key makes hashing a bigger fraction of the work and xxh3's byte throughput dominates. Second, the whole field is slower and more tightly bunched: with every fixed-24 key on the heap, a hit pays both the hashing of 24 bytes and a pointer chase to the stored characters for the comparison. On the Xeon the robin-hood-style ska::flat_hash_map and tsl::robin_map (xxh3) lead at scale, about 115 ns at 10^7, with the rest of the flat tables within 15-20 ns of them; std::unordered_map is the main outlier at 192 ns. In cache (1,024 elements) fph::DynamicFphMap and ankerl::unordered_dense_map are quickest at roughly 11.5-12 ns, the perfect-hash table again benefiting from its single-probe guarantee before memory traffic dominates.

The M1 Max separates the field more clearly: the perfect-hash tables fph::DynamicFphMap and fph::MetaFphMap (xxh3) lead through the mid-sizes (15.5 and 16.9 ns at 32,768) and stay near the front to 10^7 (125.7 and 139.0 ns), helped by the M1's large caches keeping their sparser arrays resident longer. std::unordered_map is again last (208 ns at 10^7).

The max-length variant below is noticeably faster, often nearly half the time at small sizes (for instance tsl::robin_map at about 6.8 ns vs 18.6 ns at 1,024 on the Xeon), precisely because most max-24 keys are short enough to stay inline and skip the heap dereference. The large-max_load_factor charts keep the same ranking with slightly denser packing.

<K,V>: <string with a max length of 24, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

As at 12 bytes, misses are cheaper and fph::MetaFphMap leads because its metadata rejects an absent key without dereferencing the heap pointer or comparing any bytes. On the Xeon it answers a fixed-24 miss in about 6 ns at 1,024 and 10.6 ns at 32,768, ahead of the SwissTable tables (absl::flat_hash_map 11.6 ns, absl::node_hash_map 12.0 ns), and it stays fastest to 10^7 (69.8 ns). The robin-hood tables again lag in the mid-range (tsl::robin_map 24.6 ns, ska::flat_hash_map 25.1 ns at 32,768) because of their longer probe runs. The metadata advantage is even larger on the M1 Max, where fph::MetaFphMap resolves a miss in 4.7-9.5 ns up through 200,000 elements and only 40.5 ns at 10^7, well clear of the field.

The max-length variant follows the same ordering but with smaller absolute numbers since many keys stay inline; on the M1 Max fph::MetaFphMap dips to just 9-11 ns through 1.2M elements. The large-max_load_factor charts are again consistent with this pattern.

<K,V>: <string with a max length of 24, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

The mixed workload lands between the two: the SwissTable tables absl::flat_hash_map and r_h::unordered_flat_map (xxh3) and fph::MetaFphMap share the lead, around 8.6-10.2 ns at 1,024 and roughly 105-117 ns at 10^7 on the Xeon, with std::unordered_map trailing at 187 ns. Because half the queries are misses that never touch the heap-stored bytes, the metadata-friendly tables do better here than in the pure-hit case. The M1 Max keeps the same set of flat tables in front with its usual flatter curves. The max-length variant and the large-max_load_factor appendix charts follow the same pattern.

<K,V>: <string with a max length of 24, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

Latency

The P99 latency charts (Xeon only) capture the worst 1% of lookups. The 24-byte heap layout makes these tails heavier than at 12 bytes, because a slow lookup can miss the cache on the slot array, on the heap-stored key bytes, and on the page table all at once.

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

For fixed-24 hits the tails climb steeply and converge: by 10^7 every flat table sits in a 750-830 ns band, with r_h::unordered_flat_map (xxh3) best at 750 ns and std::unordered_map worst at 930 ns. The jump from the in-cache regime is large, the tail rising from about 50 ns at 1,024 to several hundred nanoseconds already at 32,768, since the worst-case lookup now reliably misses on the heap-stored key. The max-length variant has visibly lower tails (e.g. roughly 535-695 ns at 10^7) because the inline short keys spare those lookups the extra heap miss.

<K,V>: <string with a max length of 24, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

On the miss path the metadata tables keep their tails lowest in cache: fph::MetaFphMap and ankerl::unordered_dense_map hold around 22-51 ns up to 32,768, while the robin-hood tables already spike past 145 ns there. By 10^7 the tails again merge into the 600-700 ns range. The max-24 variant is the more telling one: there the tail stays low far longer (the leaders are near 130 ns even at 200,000 elements) before climbing, because most missing keys are rejected from inline data without a heap touch. This shows how the SSO boundary, not just the table algorithm, shapes the latency tail.

<K,V>: <string with a max length of 24, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

With half the queries hitting, the tail is set by the heavier hit path and the fixed-24 curves converge into the 740-980 ns band at 10^7, r_h::unordered_flat_map best at 745 ns and std::unordered_map worst at 980 ns. The metadata advantage that fph::MetaFphMap enjoys on pure misses is diluted here because the hit half still pays the heap dereference and byte compare. As before the max-length variant and the large-max_load_factor appendix charts follow the same pattern.

<K,V>: <string with a max length of 24, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

← Back to Hash Table Benchmark index

Author: Renzibei
Link: https://renzibei.com/en/24-byte-string-lookup/
Copyright Notice: All articles on this website, unless otherwise stated, are created by the author and are licensed under CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.

post in 2026-06-13

tags:

{ hashtable }

{ benchmark }

{ algorithm }