Hash Table Benchmark - 12 Byte String Lookup
The 12 byte string lookup test.
Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
In this test, we measure the lookup performance of hash tables in three kinds of situations:
- Look up the keys in the hash table (hit or successful find).
- Look up the keys not in the hash table (miss or unsuccessful find).
- Look up keys with a 50% probability of being in the hash table.
There are two kinds of keys in this test: strings with a fixed length of 12 bytes, and strings with a max length of 12 bytes.
Unlike the integer tests, string-key lookup spends a large share of
its time inside the hash function and the key comparison: the whole byte
sequence has to be fed through the hash, and on a hit the candidate
slot's bytes have to be compared against the query. This makes the
choice of hash function much more visible than for integers. The four
hashes tested here are std::hash, absl::Hash,
robin_hood::hash, and xxHash_xxh3; xxh3 is
purpose-built for hashing byte ranges, so it tends to win for the table
whose lookup loop is hash-bound. A second string-specific effect is
Small String Optimization (SSO): a 12-byte std::string is
stored inline in the string object, so there is no separate heap
allocation and the characters are reached without a pointer dereference.
The fixed-length variant makes every key exactly 12 bytes, while the
max-length variant lets keys vary up to 12 bytes, which also exercises
the length-dependent branches of the hash and comparison code. Below,
each chart shows the fastest hash per table (click the legend to reveal
the rest); the Xeon E-2388G and M1 Max throughput charts are paired, and
latency is measured on the Xeon only.
Throughput
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
On a successful lookup the table has to compute the hash, find the
slot, and then actually compare the 12 query bytes against the stored
key, so every table pays the full hash-plus-compare cost. On the Xeon
E-2388G the perfect-hash fph::DynamicFphMap is the fastest
while the working set stays in cache: with xxHash_xxh3 it
does a hit in about 6.5 ns at 1,024 elements and 8.4 ns at 32,768, ahead
of every conventional table. This is the regime where fph shines,
because its minimal perfect hash guarantees the key lands in its slot on
the first probe, so the only memory touch is the single slot it reads.
As the table grows past the L3 cache the picture inverts: once the
lookup becomes memory-bound, the tables that keep their probe sequence
short and local win instead. At 10^7 elements
tsl::robin_map and ska::flat_hash_map (both
with xxh3) lead at about 42 ns, while fph::DynamicFphMap
slows to 57 ns and the metadata-heavier fph::MetaFphMap to
64 ns, because the perfect-hash tables spread their entries over a
sparser array and miss the cache more often at scale.
The hash choice is decisive here. xxHash_xxh3 is the
per-table winner for most of the open-addressing and perfect-hash
tables, since their lookup loop is dominated by hashing 12 bytes; the
SwissTable-style emhash::hash_map7 and
ska::bytell_hash_map instead pair best with
absl::Hash. The node-based tables trail badly once they
leave cache: absl::node_hash_map reaches 107 ns and
std::unordered_map 118 ns at 10^7, each lookup chasing a
heap node pointer on top of the hash work.
The M1 Max tells a similar but flatter story. With its larger caches
the SwissTable-family tables lead at scale
(emhash::hash_map7 and ska::bytell_hash_map
around 60-66 ns at 10^7), and absl::Hash is the winning
hash for several tables there rather than xxh3, reflecting the Arm-tuned
implementation. The node-based tables again finish last
(std::unordered_map at about 200 ns at 10^7).
The max-length variant below behaves almost identically: with all
keys still fitting SSO and capped at 12 bytes, length variation barely
changes the hash/compare cost, so the rankings and magnitudes match the
fixed-length case. The large-max_load_factor charts pack
the entries more densely, which slightly helps cache residency at large
sizes but lengthens probe sequences, leaving the overall ordering
unchanged.
<K,V>: <string with a max length of 12, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
<K,V>: <string with a max length of 12, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
A miss is cheaper than a hit because most tables can reject the query
from metadata alone, without ever comparing the 12 bytes. This is where
fph::MetaFphMap is the clear winner: its per-slot metadata
lets it answer "not present" after a single metadata check, so on the
Xeon it returns a miss in about 3.6 ns at 1,024 elements and stays under
10 ns all the way to 1.2M elements (9.65 ns), far ahead of the field.
Even at 10^7, where it becomes DRAM-bound and rises to 30.7 ns, it is
competitive with the best SwissTable tables
(absl::flat_hash_map at 25.1 ns,
r_h::unordered_flat_map at 29.0 ns). The robin-hood-style
tables tsl::robin_map and ska::flat_hash_map
are much slower in the cache regime (16-26 ns at 32,768-200,000) because
their backward-shift probing must walk a run of slots before concluding
the key is absent.
Because a miss avoids the full byte comparison, the hash function
matters slightly less here, and the per-table winner is more often
absl::Hash than xxh3. The node-based
std::unordered_map remains the slowest at scale (110 ns at
10^7), since even a miss requires hashing and then traversing a bucket's
node chain. On the M1 Max the same ordering holds with smaller absolute
numbers: fph::MetaFphMap answers misses in 4-6 ns through
1.2M elements and only 18.6 ns at 10^7, again the tightest curve on the
platform.
The max-length variant and the large-max_load_factor
charts repeat this pattern; the only visible change is that the denser
large-load-factor tables lengthen the robin-hood probe runs slightly,
widening their gap behind the metadata-based and SwissTable tables.
<K,V>: <string with a max length of 12, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
<K,V>: <string with a max length of 12, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
The 50%-hit workload mixes hits and misses, so it sits between the
two cases and the ranking changes accordingly. On the Xeon the
SwissTable-style absl::flat_hash_map and
r_h::unordered_flat_map (both with xxh3) now top the
small-to-mid range at about 6 ns at 1,024 and 13-14 ns at 32,768, with
fph::MetaFphMap right alongside them; the perfect-hash
tables no longer dominate the cache regime as cleanly as in the pure-hit
case because half the queries are misses that the metadata tables
resolve very cheaply. At 10^7 the robin-hood tables again pull ahead on
memory locality (tsl::robin_map 43.5 ns,
ska::flat_hash_map 44.4 ns), while
std::unordered_map trails at 119.5 ns.
The M1 Max ranks r_h::unordered_flat_map and
absl::flat_hash_map first across most sizes, with the usual
flatter curves; the per-table best hash is a mix of xxh3 and
absl::Hash. As before, the max-length variant and the
large-max_load_factor charts follow the same pattern with
no qualitative change.
<K,V>: <string with a max length of 12, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
<K,V>: <string with a max length of 12, uint64_t>
Latency
The P99 latency charts (Xeon only) show the tail of the lookup-time distribution: the 99th-percentile single lookup, which is dominated by the worst cache and TLB misses on the probe path. As the working set outgrows the L3 cache, every table's tail jumps to several hundred nanoseconds because the slowest 1% of lookups now take a full DRAM round trip.
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
For hits, the tail is set by how many cache lines the probe sequence
touches in its worst case. In cache, fph::DynamicFphMap and
fph::MetaFphMap have the lowest tails (about 21 ns at 1,024
and 36-37 ns at 32,768) because their single-probe guarantee bounds the
worst case tightly. Once the table spills to DRAM the tails converge
into the 460-560 ns band for the flat tables, where
r_h::unordered_flat_map and
absl::flat_hash_map are best at 10^7 (about 527 and 557
ns). The node-based tables have the worst tails throughout
(absl::node_hash_map 680 ns,
std::unordered_map 795 ns at 10^7), since a single lookup
can miss the cache both on the bucket array and on the node it points
to.
<K,V>: <string with a max length of 12, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
<K,V>: <string with a max length of 12, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
The miss tail is where fph::MetaFphMap's metadata pays
off most dramatically: its P99 stays extremely flat, only 16-35 ns from
1,024 up to 200,000 elements and 59.5 ns at 1.2M, where every other
table has already climbed into the hundreds of nanoseconds. Because a
miss is settled by a single metadata read, its worst case rarely needs a
second cache line, so the tail does not blow up until the table no
longer fits in DRAM-resident pages (482 ns at 10^7). The robin-hood
tables tsl::robin_map and ska::flat_hash_map
show the opposite behaviour, jumping to about 412 ns already at 200,000
elements because a worst-case miss walks a long probe run.
std::unordered_map again has the heaviest tail (905 ns at
10^7).
<K,V>: <string with a max length of 12, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
<K,V>: <string with a max length of 12, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
With half the queries hitting, the tail is governed by the more
expensive hit path: the flat SwissTable tables
r_h::unordered_flat_map and
absl::flat_hash_map keep the best P99 (about 23 and 22 ns
at 1,024, rising to 527 and 534 ns at 10^7).
fph::MetaFphMap no longer has the flat advantage it showed
on pure misses, since the hit half of the workload still requires the
byte comparison and a possible extra cache line. The node-based
std::unordered_map once more has the largest tail (850 ns
at 10^7). The max-length variant and the
large-max_load_factor appendix charts follow the same
pattern.
<K,V>: <string with a max length of 12, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 12, uint64_t>
<K,V>: <string with a max length of 12, uint64_t>
- Link: https://renzibei.com/en/12-byte-string-lookup/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.