Hash Table Benchmark - 24 Byte String Lookup
The 24 byte string lookup test.
Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
In this test, we measure the lookup performance of hash tables in three kinds of situations:
- Look up the keys in the hash table (hit or successful find).
- Look up the keys not in the hash table (miss or unsuccessful find).
- Look up keys with a 50% probability of being in the hash table.
There are two kinds of keys in this test: strings with a fixed length of 24 bytes, and strings with a max length of 24 bytes.
At 24 bytes the keys have crossed the Small String Optimization
boundary: a std::string of this length no longer fits
inline (libstdc++ stores up to 15 bytes inline), so the fixed-24 keys
are all heap-allocated and the lookup must dereference a pointer to
reach the characters before it can hash or compare them. This adds a
near-guaranteed cache miss per key to the hash/compare cost discussed in
the 12-byte post, and it makes the fixed-length and max-length variants
behave quite differently: in the max-24 case many keys are short enough
to stay inline, avoiding that extra indirection. The four hashes tested
are again std::hash, absl::Hash,
robin_hood::hash, and xxHash_xxh3; with more
bytes to digest per key, the bytes-optimized xxh3 now wins for almost
every table. Each chart shows the best hash per table; Xeon E-2388G and
M1 Max throughput are paired and latency is Xeon-only.
Throughput
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
Two things stand out compared to the 12-byte test. First,
xxHash_xxh3 is now the per-table winner for essentially
every table on the Xeon, because the larger key makes hashing a bigger
fraction of the work and xxh3's byte throughput dominates. Second, the
whole field is slower and more tightly bunched: with every fixed-24 key
on the heap, a hit pays both the hashing of 24 bytes and a pointer chase
to the stored characters for the comparison. On the Xeon the
robin-hood-style ska::flat_hash_map and
tsl::robin_map (xxh3) lead at scale, about 115 ns at 10^7,
with the rest of the flat tables within 15-20 ns of them;
std::unordered_map is the main outlier at 192 ns. In cache
(1,024 elements) fph::DynamicFphMap and
ankerl::unordered_dense_map are quickest at roughly 11.5-12
ns, the perfect-hash table again benefiting from its single-probe
guarantee before memory traffic dominates.
The M1 Max separates the field more clearly: the perfect-hash tables
fph::DynamicFphMap and fph::MetaFphMap (xxh3)
lead through the mid-sizes (15.5 and 16.9 ns at 32,768) and stay near
the front to 10^7 (125.7 and 139.0 ns), helped by the M1's large caches
keeping their sparser arrays resident longer.
std::unordered_map is again last (208 ns at 10^7).
The max-length variant below is noticeably faster, often nearly half
the time at small sizes (for instance tsl::robin_map at
about 6.8 ns vs 18.6 ns at 1,024 on the Xeon), precisely because most
max-24 keys are short enough to stay inline and skip the heap
dereference. The large-max_load_factor charts keep the same
ranking with slightly denser packing.
<K,V>: <string with a max length of 24, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
<K,V>: <string with a max length of 24, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
As at 12 bytes, misses are cheaper and fph::MetaFphMap
leads because its metadata rejects an absent key without dereferencing
the heap pointer or comparing any bytes. On the Xeon it answers a
fixed-24 miss in about 6 ns at 1,024 and 10.6 ns at 32,768, ahead of the
SwissTable tables (absl::flat_hash_map 11.6 ns,
absl::node_hash_map 12.0 ns), and it stays fastest to 10^7
(69.8 ns). The robin-hood tables again lag in the mid-range
(tsl::robin_map 24.6 ns, ska::flat_hash_map
25.1 ns at 32,768) because of their longer probe runs. The metadata
advantage is even larger on the M1 Max, where
fph::MetaFphMap resolves a miss in 4.7-9.5 ns up through
200,000 elements and only 40.5 ns at 10^7, well clear of the field.
The max-length variant follows the same ordering but with smaller
absolute numbers since many keys stay inline; on the M1 Max
fph::MetaFphMap dips to just 9-11 ns through 1.2M elements.
The large-max_load_factor charts are again consistent with
this pattern.
<K,V>: <string with a max length of 24, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
<K,V>: <string with a max length of 24, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
The mixed workload lands between the two: the SwissTable tables
absl::flat_hash_map and
r_h::unordered_flat_map (xxh3) and
fph::MetaFphMap share the lead, around 8.6-10.2 ns at 1,024
and roughly 105-117 ns at 10^7 on the Xeon, with
std::unordered_map trailing at 187 ns. Because half the
queries are misses that never touch the heap-stored bytes, the
metadata-friendly tables do better here than in the pure-hit case. The
M1 Max keeps the same set of flat tables in front with its usual flatter
curves. The max-length variant and the
large-max_load_factor appendix charts follow the same
pattern.
<K,V>: <string with a max length of 24, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
<K,V>: <string with a max length of 24, uint64_t>
Latency
The P99 latency charts (Xeon only) capture the worst 1% of lookups. The 24-byte heap layout makes these tails heavier than at 12 bytes, because a slow lookup can miss the cache on the slot array, on the heap-stored key bytes, and on the page table all at once.
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
For fixed-24 hits the tails climb steeply and converge: by 10^7 every
flat table sits in a 750-830 ns band, with
r_h::unordered_flat_map (xxh3) best at 750 ns and
std::unordered_map worst at 930 ns. The jump from the
in-cache regime is large, the tail rising from about 50 ns at 1,024 to
several hundred nanoseconds already at 32,768, since the worst-case
lookup now reliably misses on the heap-stored key. The max-length
variant has visibly lower tails (e.g. roughly 535-695 ns at 10^7)
because the inline short keys spare those lookups the extra heap
miss.
<K,V>: <string with a max length of 24, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
<K,V>: <string with a max length of 24, uint64_t>
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
On the miss path the metadata tables keep their tails lowest in
cache: fph::MetaFphMap and
ankerl::unordered_dense_map hold around 22-51 ns up to
32,768, while the robin-hood tables already spike past 145 ns there. By
10^7 the tails again merge into the 600-700 ns range. The max-24 variant
is the more telling one: there the tail stays low far longer (the
leaders are near 130 ns even at 200,000 elements) before climbing,
because most missing keys are rejected from inline data without a heap
touch. This shows how the SSO boundary, not just the table algorithm,
shapes the latency tail.
<K,V>: <string with a max length of 24, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
<K,V>: <string with a max length of 24, uint64_t>
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
With half the queries hitting, the tail is set by the heavier hit
path and the fixed-24 curves converge into the 740-980 ns band at 10^7,
r_h::unordered_flat_map best at 745 ns and
std::unordered_map worst at 980 ns. The metadata advantage
that fph::MetaFphMap enjoys on pure misses is diluted here
because the hit half still pays the heap dereference and byte compare.
As before the max-length variant and the
large-max_load_factor appendix charts follow the same
pattern.
<K,V>: <string with a max length of 24, uint64_t>
Use large max_load_factor
<K,V>: <string with a fixed length of 24, uint64_t>
<K,V>: <string with a max length of 24, uint64_t>
- Link: https://renzibei.com/en/24-byte-string-lookup/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.