Hash Table Benchmark - Integer Iterate

The integer key iteration test.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we measure the performance of iterating over the hash table.

Unlike lookup or insertion, iteration speed is dominated almost entirely by how a table stores its entries in memory, not by the hash function. There are three broad storage strategies among the tables we test:

  • Dense array storage. ankerl::unordered_dense_map and emhash::hash_map7 keep all key-value pairs packed in a contiguous array and store only indices (or small metadata) in the hash slots. Iterating is then a linear scan over a dense array, so the cost per element is small and, crucially, independent of the load factor.
  • Inline open addressing. ska::flat_hash_map, ska::bytell_hash_map, tsl::robin_map, absl::flat_hash_map, fph::* and robin_hood::unordered_flat_map store the entries directly in a sparse slot array. To iterate they must walk the whole slot array and skip the empty slots, so the work per element grows as the table becomes emptier and the array grows larger than the cache.
  • Node-based storage. std::unordered_map and absl::node_hash_map allocate each entry in a separate node. Iterating chases pointers between nodes that may be scattered across the heap, which is cheap while the nodes stay in cache but degrades sharply once they spill to DRAM.

Throughput

<K,V>: <uint64_t with several split bits masked, uint64_t>

The charts confirm the picture above. On both platforms ankerl::unordered_dense_map is the fastest by a wide margin and is essentially flat across the whole size range — about 0.22 ns per element on the Xeon E-2388G and about 2.0 ns on the M1 Max — because it always iterates a packed array regardless of how many slots the table has. emhash::hash_map7 is the runner-up with the same flat behaviour (about 0.6 ns on the Xeon and 2.6 ns on the M1).

Each inline open-addressing table behaves differently: its per-element cost rises with the number of elements as the slot array grows past the cache. ska::flat_hash_map, for example, climbs from roughly 1 ns at small sizes to about 9-10 ns at 10^7 elements on the Xeon, because most of the time is then spent reading empty slots from memory. The node-based std::unordered_map is the slowest at large sizes — around 35 ns per element at 10^7 on the Xeon and 22 ns on the M1 — since iterating its node list becomes a stream of cache-missing pointer dereferences.

A couple of combinations stop after the mid-size points. absl::flat_hash_map and absl::node_hash_map assume a well-distributed hash, but the integer std::hash is the identity function; on the masked-bit keys it collides heavily, so construction times out at large sizes and those data points are recorded as zero and not plotted.

The remaining integer distributions and the 56-byte value type, shown below, give the same ranking. The dense-storage tables stay flat and fastest; the only effect of the larger 56-byte slot is that the open-addressing tables, which now move 64 bytes per slot, become memory-bound a little earlier.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Latency

The P99 latency of a single iteration step tells the same story from the tail end (latency is measured on the x86-64 platform only). ankerl::unordered_dense_map stays flat at about 1.6 ns regardless of size, because advancing the iterator over a dense array never misses far. The inline and node-based tables develop growing tails as the backing storage outgrows the cache: at 10^7 elements the P99 step reaches tens of nanoseconds for the open-addressing tables and hundreds of nanoseconds for the node-based std::unordered_map and absl::node_hash_map (about 374 ns and 439 ns respectively), with each spike corresponding to a cache miss on the next slot or node.

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>


← Back to Hash Table Benchmark index

  • Author: Renzibei
  • Link: https://renzibei.com/en/int-iterate/
  • Copyright Notice: All articles on this website, unless otherwise stated, are created by the author and are licensed under Creative Commons License CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.