Hash Table Benchmark - Integer Iterate
The integer key iteration test.
Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
In this test, we measure the performance of iterating over the hash table.
Unlike lookup or insertion, iteration speed is dominated almost entirely by how a table stores its entries in memory, not by the hash function. There are three broad storage strategies among the tables we test:
- Dense array storage.
ankerl::unordered_dense_mapandemhash::hash_map7keep all key-value pairs packed in a contiguous array and store only indices (or small metadata) in the hash slots. Iterating is then a linear scan over a dense array, so the cost per element is small and, crucially, independent of the load factor. - Inline open addressing.
ska::flat_hash_map,ska::bytell_hash_map,tsl::robin_map,absl::flat_hash_map,fph::*androbin_hood::unordered_flat_mapstore the entries directly in a sparse slot array. To iterate they must walk the whole slot array and skip the empty slots, so the work per element grows as the table becomes emptier and the array grows larger than the cache. - Node-based storage.
std::unordered_mapandabsl::node_hash_mapallocate each entry in a separate node. Iterating chases pointers between nodes that may be scattered across the heap, which is cheap while the nodes stay in cache but degrades sharply once they spill to DRAM.
Throughput
<K,V>: <uint64_t with several split bits masked, uint64_t>
The charts confirm the picture above. On both platforms
ankerl::unordered_dense_map is the fastest by a wide margin
and is essentially flat across the whole size range — about 0.22 ns per
element on the Xeon E-2388G and about 2.0 ns on the M1 Max — because it
always iterates a packed array regardless of how many slots the table
has. emhash::hash_map7 is the runner-up with the same flat
behaviour (about 0.6 ns on the Xeon and 2.6 ns on the M1).
Each inline open-addressing table behaves differently: its
per-element cost rises with the number of elements as the slot array
grows past the cache. ska::flat_hash_map, for example,
climbs from roughly 1 ns at small sizes to about 9-10 ns at 10^7
elements on the Xeon, because most of the time is then spent reading
empty slots from memory. The node-based std::unordered_map
is the slowest at large sizes — around 35 ns per element at 10^7 on the
Xeon and 22 ns on the M1 — since iterating its node list becomes a
stream of cache-missing pointer dereferences.
A couple of combinations stop after the mid-size points.
absl::flat_hash_map and absl::node_hash_map
assume a well-distributed hash, but the integer std::hash
is the identity function; on the masked-bit keys it collides heavily, so
construction times out at large sizes and those data points are recorded
as zero and not plotted.
The remaining integer distributions and the 56-byte value type, shown below, give the same ranking. The dense-storage tables stay flat and fastest; the only effect of the larger 56-byte slot is that the open-addressing tables, which now move 64 bytes per slot, become memory-bound a little earlier.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Latency
The P99 latency of a single iteration step tells the same story from
the tail end (latency is measured on the x86-64 platform only).
ankerl::unordered_dense_map stays flat at about 1.6 ns
regardless of size, because advancing the iterator over a dense array
never misses far. The inline and node-based tables develop growing tails
as the backing storage outgrows the cache: at 10^7 elements the P99 step
reaches tens of nanoseconds for the open-addressing tables and hundreds
of nanoseconds for the node-based std::unordered_map and
absl::node_hash_map (about 374 ns and 439 ns respectively),
with each spike corresponding to a cache miss on the next slot or
node.
<K,V>: <uint64_t with several split bits masked, uint64_t>
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
- Link: https://renzibei.com/en/int-iterate/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.