Hash Table Benchmark - 64 Byte String Lookup
The 64 byte string lookup test.
Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
In this test, we measure the lookup performance of hash tables in three kinds of situations:
- Look up the keys in the hash table (hit or successful find).
- Look up the keys not in the hash table (miss or unsuccessful find).
- Look up keys with a 50% probability of being in the hash table.
There are two kinds of keys in this test: strings with a fixed length of 64 bytes, and strings with a max length of 64 bytes.
At 64 bytes every fixed-length key is well past the Small String
Optimization limit, so all of them live on the heap and a lookup must
follow a pointer to reach the characters before hashing or comparing
them. The key is now four cache lines of data, which makes the hash
function cost the dominant term in the lookup: feeding 64 bytes through
the hash takes far longer than the slot arithmetic. As a result
xxHash_xxh3, which is tuned for byte throughput, is the
per-table winner for essentially every table on both machines in this
test, and the differences between table layouts shrink because they all
pay the same large hashing and pointer-chasing cost. The fixed-64 and
max-64 variants again differ in that the max-length keys include many
short, SSO-eligible strings that skip the heap indirection. Charts pair
the Xeon E-2388G with the M1 Max for throughput; latency is
Xeon-only.
Throughput
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
With hashing 64 bytes dominating, the field is closer together than
in the shorter-key tests, but the ordering is still informative. On the
Xeon the perfect-hash fph::DynamicFphMap and
fph::MetaFphMap (xxh3) are fastest in cache (about 16.4 and
16.8 ns at 1,024, 54.6 and 56.5 ns at 32,768) thanks to their
single-probe guarantee, which still matters even when the hash is
expensive. At 10^7 they remain near the front (152.7 and 163.5 ns) but
the short-probe robin-hood tables tsl::robin_map and
ska::flat_hash_map catch up (151.7 and 152.2 ns). Every
table uses xxh3 as its best hash here. The node-based
std::unordered_map is clearly behind at 246.7 ns, paying a
heap node dereference on top of the already-heavy key dereference.
The M1 Max shows the perfect-hash tables leading more decisively,
with fph::DynamicFphMap fastest across the whole range
(15.5 ns at 32,768, 112.7 ns at 10^7), helped by the large M1 caches
keeping its sparser array resident. std::unordered_map is
again last at 184.5 ns.
The max-length variant below is markedly faster, for instance
tsl::robin_map runs a hit in about 9.9 ns at 1,024 versus
18.4 ns for fixed-64, because the short SSO-eligible keys avoid both the
heap dereference and the cost of hashing a full 64 bytes. The
large-max_load_factor charts keep the same ranking with
denser packing.
<K,V>: <string with a max length of 64, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
<K,V>: <string with a max length of 64, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
Misses are again the case where fph::MetaFphMap stands
out: its metadata lets it reject an absent key without dereferencing the
heap-stored characters or comparing any bytes, so on the Xeon it leads
at 8.2 ns at 1,024, 16.9 ns at 32,768, and stays fastest to 10^7 (95.5
ns). The SwissTable tables absl::flat_hash_map and
ankerl::unordered_dense_map follow closely, while the
robin-hood tables tsl::robin_map and
ska::flat_hash_map fall behind in the mid-range (41.6 and
43.1 ns at 32,768) due to longer probe runs. Avoiding the full 64-byte
compare matters a lot here, so the metadata and SwissTable schemes that
short-circuit on a tag byte pull noticeably ahead. The M1 Max amplifies
the metadata advantage: fph::MetaFphMap answers a miss in
5.9-18 ns up to 200,000 elements and just 54.2 ns at 10^7.
std::unordered_map is the slowest at scale on both machines
(231 ns Xeon, 125 ns M1 at 10^7).
The max-length variant follows the same ranking with smaller numbers,
and the large-max_load_factor charts are consistent with
this pattern.
<K,V>: <string with a max length of 64, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
<K,V>: <string with a max length of 64, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
The mixed workload sits between the two cases. On the Xeon
absl::flat_hash_map, r_h::unordered_flat_map
and fph::MetaFphMap (all xxh3) share the lead, around
11.7-12 ns at 1,024 and 137-148 ns at 10^7, with
std::unordered_map trailing at 235 ns. On the M1 Max
fph::MetaFphMap is fastest across most sizes (22.8 ns at
32,768, 104.2 ns at 10^7), since the miss half of the workload rewards
its metadata while the hit half still benefits from its single-probe
layout. The max-length variant and the
large-max_load_factor appendix charts follow the same
pattern.
<K,V>: <string with a max length of 64, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
<K,V>: <string with a max length of 64, uint64_t>
Latency
The P99 latency charts (Xeon only) show the tail. With 64-byte heap-allocated keys, a worst-case lookup can miss on the slot array, on the key's heap buffer, and on the page table, so the tails are the heaviest of the three string tests.
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
For fixed-64 hits the tails converge sharply: from about 67-92 ns at
1,024 they jump to the 550-620 ns range already at 32,768 and then to
roughly 830-955 ns at 10^7 for the plotted tables, where
r_h::unordered_flat_map (xxh3) is at the front at 830 ns
and absl::node_hash_map is the slowest of them at 955 ns.
std::unordered_map is slower still, but its tail runs past
the chart's 1,000 ns display limit (about 1,025 ns at 10^7), so it is
not drawn here. The steep early rise reflects the guaranteed heap miss
on the key bytes once the buffers no longer fit in cache. The max-length
variant has lower tails (the leading tables run roughly 290-330 ns at
32,768 versus 550+ for fixed) because the inline short keys spare many
lookups that miss.
<K,V>: <string with a max length of 64, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
<K,V>: <string with a max length of 64, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
On the miss path fph::MetaFphMap keeps the lowest tail
in cache (24.8 ns at 1,024, 105.6 ns at 32,768) because its metadata
settles a miss without touching the heap-stored key bytes, and
ankerl::unordered_dense_map is the next best. The
robin-hood tables blow up earliest, past 400 ns at 32,768. The max-64
variant pushes the tail-blowup point out considerably, the leaders
staying near 52-62 ns at 32,768, because most missing keys are short
enough to be rejected from inline data. As at 24 bytes, this shows that
the SSO boundary shapes the latency tail as much as the table algorithm
does.
<K,V>: <string with a max length of 64, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
<K,V>: <string with a max length of 64, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
With half the queries hitting, the tail is governed by the heavier
hit path and most fixed-64 curves bunch into the 805-890 ns band at 10^7
(many tables cluster near 805-845 ns, with fph::MetaFphMap
the highest among them at about 890 ns); std::unordered_map
again runs past the chart's 1,000 ns display limit (about 1,065 ns) and
is not drawn. The metadata advantage that fph::MetaFphMap
enjoys on pure misses is diluted because the hit half still requires the
heap dereference and 64-byte compare. The max-length variant and the
large-max_load_factor appendix charts follow the same
pattern.
<K,V>: <string with a max length of 64, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 64, uint64_t>
<K,V>: <string with a max length of 64, uint64_t>
- Link: https://renzibei.com/en/64-byte-string-lookup/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.