Hash Table Benchmark - 12 Byte String Lookup

Throughput
Latency

The 12 byte string lookup test.

📄 Raw CSV data for this benchmark: Xeon E-2388G · M1 Max.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we measure the lookup performance of hash tables in three kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessful find).
Look up keys with a 50% probability of being in the hash table.

There are two kinds of keys in this test: strings with a fixed length of 12 bytes, and strings with a max length of 12 bytes.

Unlike the integer tests, string-key lookup spends a large share of its time inside the hash function and the key comparison: the whole byte sequence has to be fed through the hash, and on a hit the candidate slot's bytes have to be compared against the query. This makes the choice of hash function much more visible than for integers. The four hashes tested here are std::hash, absl::Hash, robin_hood::hash, and xxHash_xxh3; xxh3 is purpose-built for hashing byte ranges, so it tends to win for the table whose lookup loop is hash-bound. A second string-specific effect is Small String Optimization (SSO): a 12-byte std::string is stored inline in the string object, so there is no separate heap allocation and the characters are reached without a pointer dereference. The fixed-length variant makes every key exactly 12 bytes, while the max-length variant lets keys vary up to 12 bytes, which also exercises the length-dependent branches of the hash and comparison code. Below, each chart shows the fastest hash per table (click the legend to reveal the rest); the Xeon E-2388G and M1 Max throughput charts are paired, and latency is measured on the Xeon only.

Throughput

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

On a successful lookup the table has to compute the hash, find the slot, and then actually compare the 12 query bytes against the stored key, so every table pays the full hash-plus-compare cost. On the Xeon E-2388G the perfect-hash fph::DynamicFphMap is the fastest while the working set stays in cache: with xxHash_xxh3 it does a hit in about 6.5 ns at 1,024 elements and 8.4 ns at 32,768, ahead of every conventional table. This is the regime where fph shines, because its minimal perfect hash guarantees the key lands in its slot on the first probe, so the only memory touch is the single slot it reads. As the table grows past the L3 cache the picture inverts: once the lookup becomes memory-bound, the tables that keep their probe sequence short and local win instead. At 10^7 elements tsl::robin_map and ska::flat_hash_map (both with xxh3) lead at about 42 ns, while fph::DynamicFphMap slows to 57 ns and the metadata-heavier fph::MetaFphMap to 64 ns, because the perfect-hash tables spread their entries over a sparser array and miss the cache more often at scale.

The hash choice is decisive here. xxHash_xxh3 is the per-table winner for most of the open-addressing and perfect-hash tables, since their lookup loop is dominated by hashing 12 bytes; the SwissTable-style emhash::hash_map7 and ska::bytell_hash_map instead pair best with absl::Hash. The node-based tables trail badly once they leave cache: absl::node_hash_map reaches 107 ns and std::unordered_map 118 ns at 10^7, each lookup chasing a heap node pointer on top of the hash work.

The M1 Max tells a similar but flatter story. With its larger caches the SwissTable-family tables lead at scale (emhash::hash_map7 and ska::bytell_hash_map around 60-66 ns at 10^7), and absl::Hash is the winning hash for several tables there rather than xxh3, reflecting the Arm-tuned implementation. The node-based tables again finish last (std::unordered_map at about 200 ns at 10^7).

The max-length variant below behaves almost identically: with all keys still fitting SSO and capped at 12 bytes, length variation barely changes the hash/compare cost, so the rankings and magnitudes match the fixed-length case. The large-max_load_factor charts pack the entries more densely, which slightly helps cache residency at large sizes but lengthens probe sequences, leaving the overall ordering unchanged.

<K,V>: <string with a max length of 12, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

A miss is cheaper than a hit because most tables can reject the query from metadata alone, without ever comparing the 12 bytes. This is where fph::MetaFphMap is the clear winner: its per-slot metadata lets it answer "not present" after a single metadata check, so on the Xeon it returns a miss in about 3.6 ns at 1,024 elements and stays under 10 ns all the way to 1.2M elements (9.65 ns), far ahead of the field. Even at 10^7, where it becomes DRAM-bound and rises to 30.7 ns, it is competitive with the best SwissTable tables (absl::flat_hash_map at 25.1 ns, r_h::unordered_flat_map at 29.0 ns). The robin-hood-style tables tsl::robin_map and ska::flat_hash_map are much slower in the cache regime (16-26 ns at 32,768-200,000) because their backward-shift probing must walk a run of slots before concluding the key is absent.

Because a miss avoids the full byte comparison, the hash function matters slightly less here, and the per-table winner is more often absl::Hash than xxh3. The node-based std::unordered_map remains the slowest at scale (110 ns at 10^7), since even a miss requires hashing and then traversing a bucket's node chain. On the M1 Max the same ordering holds with smaller absolute numbers: fph::MetaFphMap answers misses in 4-6 ns through 1.2M elements and only 18.6 ns at 10^7, again the tightest curve on the platform.

The max-length variant and the large-max_load_factor charts repeat this pattern; the only visible change is that the denser large-load-factor tables lengthen the robin-hood probe runs slightly, widening their gap behind the metadata-based and SwissTable tables.

<K,V>: <string with a max length of 12, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

The 50%-hit workload mixes hits and misses, so it sits between the two cases and the ranking changes accordingly. On the Xeon the SwissTable-style absl::flat_hash_map and r_h::unordered_flat_map (both with xxh3) now top the small-to-mid range at about 6 ns at 1,024 and 13-14 ns at 32,768, with fph::MetaFphMap right alongside them; the perfect-hash tables no longer dominate the cache regime as cleanly as in the pure-hit case because half the queries are misses that the metadata tables resolve very cheaply. At 10^7 the robin-hood tables again pull ahead on memory locality (tsl::robin_map 43.5 ns, ska::flat_hash_map 44.4 ns), while std::unordered_map trails at 119.5 ns.

The M1 Max ranks r_h::unordered_flat_map and absl::flat_hash_map first across most sizes, with the usual flatter curves; the per-table best hash is a mix of xxh3 and absl::Hash. As before, the max-length variant and the large-max_load_factor charts follow the same pattern with no qualitative change.

<K,V>: <string with a max length of 12, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

Latency

The P99 latency charts (Xeon only) show the tail of the lookup-time distribution: the 99th-percentile single lookup, which is dominated by the worst cache and TLB misses on the probe path. As the working set outgrows the L3 cache, every table's tail jumps to several hundred nanoseconds because the slowest 1% of lookups now take a full DRAM round trip.

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

For hits, the tail is set by how many cache lines the probe sequence touches in its worst case. In cache, fph::DynamicFphMap and fph::MetaFphMap have the lowest tails (about 21 ns at 1,024 and 36-37 ns at 32,768) because their single-probe guarantee bounds the worst case tightly. Once the table spills to DRAM the tails converge into the 460-560 ns band for the flat tables, where r_h::unordered_flat_map and absl::flat_hash_map are best at 10^7 (about 527 and 557 ns). The node-based tables have the worst tails throughout (absl::node_hash_map 680 ns, std::unordered_map 795 ns at 10^7), since a single lookup can miss the cache both on the bucket array and on the node it points to.

<K,V>: <string with a max length of 12, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

The miss tail is where fph::MetaFphMap's metadata pays off most dramatically: its P99 stays extremely flat, only 16-35 ns from 1,024 up to 200,000 elements and 59.5 ns at 1.2M, where every other table has already climbed into the hundreds of nanoseconds. Because a miss is settled by a single metadata read, its worst case rarely needs a second cache line, so the tail does not blow up until the table no longer fits in DRAM-resident pages (482 ns at 10^7). The robin-hood tables tsl::robin_map and ska::flat_hash_map show the opposite behaviour, jumping to about 412 ns already at 200,000 elements because a worst-case miss walks a long probe run. std::unordered_map again has the heaviest tail (905 ns at 10^7).

<K,V>: <string with a max length of 12, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

With half the queries hitting, the tail is governed by the more expensive hit path: the flat SwissTable tables r_h::unordered_flat_map and absl::flat_hash_map keep the best P99 (about 23 and 22 ns at 1,024, rising to 527 and 534 ns at 10^7). fph::MetaFphMap no longer has the flat advantage it showed on pure misses, since the hit half of the workload still requires the byte comparison and a possible extra cache line. The node-based std::unordered_map once more has the largest tail (850 ns at 10^7). The max-length variant and the large-max_load_factor appendix charts follow the same pattern.

<K,V>: <string with a max length of 12, uint64_t>

Use large max_load_factor

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

← Back to Hash Table Benchmark index

Author: Renzibei
Link: https://renzibei.com/en/12-byte-string-lookup/
Copyright Notice: All articles on this website, unless otherwise stated, are created by the author and are licensed under CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.

post in 2026-06-13

tags:

{ hashtable }

{ benchmark }

{ algorithm }