Hash Table Benchmark - Integer Lookup Throughput
- Exploring the Connection between Lookup Performance and Memory Hierarchy
- Lookup keys in the table (hit)
- Lookup keys not in the table (miss)
- Lookup keys with a 50% probability in the table
- Hit Appendix
- Miss Appendix
- 50% Hit Appendix
The integer key lookup throughput test.
Click the labels on the legend to view or hide data lines corresponding to specific hash tables and hash functions.
In this test, we measure the lookup performance of hash tables in three kinds of situations:
- Look up the keys in the hash table (hit or successful find).
- Look up the keys not in the hash table (miss or unsuccessful find).
- Look up keys with a 50% probability of being in the hash table.
Exploring the Connection between Lookup Performance and Memory Hierarchy
Before looking at the details of lookup speed, it is important to note the strong correlation between hash table lookup speed and the cache hit rate. With integer key-value pairs, the hash table lookup operation mainly involves loading content from memory, which consumes most of the operation time.
Modern computer systems use a hierarchical memory design. For instance, Intel Rocket Lake has registers, L1 cache, L2 cache, L3 cache, and DRAM, with speed decreasing in that order. The M1 Max has L1 cache, L2 cache, SLC cache, and DRAM.
As the cache miss rate at a given level rises, the overall lookup
time becomes limited by the memory hardware speed of the next, slower
level. Besides cache misses, TLB misses also contribute to the time
penalty. The chart below helps illustrate this concept. It shows the P50
(median) hit-lookup latency on the Xeon E-2388G. The structure is
easiest to read by isolating a single open-addressing table such as
ska::flat_hash_map with std::hash (click the
other legend entries to hide them). For such a table the median lookup
succeeds after a single key comparison, so the P50 value mostly reflects
the latency of one memory load and tracks the memory hierarchy
closely.
The figure above shows that on the Intel Rocket Lake architecture, lookup performance separates into four tiers, each determined by the number of elements:
- Elements that fully fit into the L1 cache. Given a
sizeof(value_type) of 16 bytes and an L1 data cache of 48 KB, this can
hold approximately 3,072 elements. Since the default load factor of
ska::flat_hash_mapis usually less than 0.5, this range in the figure lies between 32 and 1,024. - Elements that fully fit into the L2 cache. Given a
sizeof(value_type) of 16 bytes and an L2 cache of 512 KB, this can hold
approximately 32,768 elements. For
ska::flat_hash_mapin this test, the range is 1,500 to 8,192. - Elements that fully fit into the L3 cache. Given a
sizeof(value_type) of 16 bytes and an L3 cache of 16 MB, this can hold
approximately 1,048,576 elements. For
ska::flat_hash_mapin this test, the range is 12,000 to 400,000. - The case where reading from RAM becomes inevitable. This typically occurs when the L3 cache can't hold all the elements, generally when there are more than 1,048,576 elements.
In addition to cache misses, TLB misses also have a significant influence on hash table lookup speed. Rocket Lake has 64 L1 DTLB (in 4KB mode, 32 in 2MB mode) entries and 1536 STLB entries. This means if a 4KB page size is used, considerable L1 TLB misses will occur when the number of elements exceeds 16,384, and L2 TLB misses when the number exceeds 393,216. Using huge pages can greatly reduce TLB misses. On macOS with M1 Max, a 16 KB page size is used by default, resulting in much smaller penalties due to TLB misses compared to the default 4 KB pages on the Rocket Lake platform. In the P50 curve this TLB-miss penalty adds an extra rise once the working set pushes the page table beyond the reach of the L2 TLB, on top of the cache-tier steps described above.
The M1 Max can utilize 128KB L1 data cache, 12MB L2 cache, and 48MB SLC cache per thread, resulting in a higher cache hit rate and therefore superior performance for most data sizes in the hash table test.
Lookup keys in the table (hit)
Use default max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
Broadly speaking, the lookup time for a hash table can be split into four parts:
- Time required for the hash function to compute the hash value.
- Time needed to map the hash value to a specific memory address.
- Time taken to load content from the given memory address.
- Time spent comparing the loaded content with the target key value, with additional penalty time if the comparison is unsuccessful.
When the element count is low, the CPU cache can often hold all elements. Here, the third step is relatively fast, so the other three components become important. The first and second steps can trade work with each other. If the hash function (in the first step) does not distribute elements well across the hash space, the second step needs extra computation to distribute these non-uniform hash values into the underlying slot array. If the first step uses a high-quality hash function that uniformly distributes hash values, the second step typically only needs a quick bitwise AND instruction for truncation (if the number of slots is a power of 2).
This can be seen from the charts above. Both libc++ and libstdc++'s
implementations of std::hash use the identity hash for
uint64_t, making the first step extremely fast. If a bitwise AND
instruction is directly used to get the slot-array index in the second
step, there will be many collisions, leading to redundant comparisons in
the fourth step. Hash tables that use a simple method in the second
step, such as absl::flat_hash_map,
absl::node_hash_map, emhash::hash_map7 and
tsl::robin_map require high-quality hash functions, so
std::hash is not enough.
robin_hood::unordered_flat_map goes slightly further in the
second step but doesn't achieve optimal performance with
std::hash. libc++'s implementation of
std::unordered_map also shows poor performance when using
std::hash and the number of elements is a power of 2.
robin_hood::hash, though providing insufficient hash
quality for most hash tables demanding good hash functions, is good
enough for robin_hood::unordered_flat_map, as illustrated
in the range 100,000 to 3,100,000 data points.
In contrast, hash tables that do extra work in the second step can
still achieve good performance even with the simplest identity hash in
the first step, as seen in ska::flat_hash_map,
ska::bytell_hash_map, fph::DynamicFphMap,
fph::MetaFphMap and libstdc++'s
std::unordered_map. I recommend checking out the innovative
trick ska::flat_hash_map uses in the second step, called Fibonacci
Hashing: The Optimization that the World Forgot (or: a Better
Alternative to Integer Modulo). This method uses a 64-bit
multiplication and a right-shift instruction. Since there are no
arithmetic instructions involved in the hash function (for identity
hash), these are pretty much all the arithmetic instructions the CPU's
ALU has to handle.
On the other hand, a hash table requiring a "good" hash function and a bitwise AND instruction for the second step lacks a hash function that can be implemented with a cost equal to or less than these two instructions, and still achieve satisfactory hash quality for most data distributions, to the best of my knowledge.
For this reason, the combination of ska::flat_hash_map
and std::hash currently ranks as the fastest when all the
data fits into the L2 cache on Intel Rocket Lake (or L1 cache on M1
Max). Almost any other combination requires more instructions in the
first and second steps combined. Within this data range, the
tsl::robin_map with absl::hash is the second
fastest on Rocket Lake, with a very minor difference. On M1 Max, the
fph::DynamicFphMap with std::hash combination
is almost just as fast, differing by less than 0.1 nanoseconds per
operation.
However, when the L2 cache can't hold all elements but the L3 cache
can, ska::flat_hash_map no longer holds the top spot. On
Intel Rocket Lake, the combination of fph::DynamicFphMap
and std::hash proves to be the fastest within the range of
8,192 to 3,100,000 elements (though absl::flat_hash_map
with absl::hash is slightly faster at 400,000 elements). On
M1 Max, this combination also excels when the element count is within
8,192 to 1,200,000. fph::MetaFphMap is the runner-up in
this range. Considering that both ska::flat_hash_map and
tsl::robin_map use an aggressive expansion strategy leading
to a low load factor and earlier cache misses, and that
fph::DynamicFphMap uses perfect hashing (i.e., no
conflict), these results make sense.
When the data can't fit into the L3 cache,
ska::flat_hash_map with std::hash becomes the
fastest combination once more. The tsl::robin_map and
absl::hash combination follows closely, with the difference
of one or two instructions becoming almost negligible compared to the
cost of cache misses. These two are still the fastest even with a
massive number of elements, partially because many other hash tables use
auxiliary arrays to store additional information. For example,
absl::flat_hash_map employs a metadata array which, when
dealing with large data, can have a high and costly cache miss rate. In
contrast, ska::flat_hash_map and
tsl::robin_map simply use one slot array to store key-value
data, allowing one memory-to-cache line load to suffice.
Analyzing this chain of observations, we find that high-performance
hash tables tend to have few failed comparisons in the fourth lookup
step (either ska::flat_hash_map limits failures, or
fph::DynamicFphMap has no failures as a perfect hash
table).
Under this premise, when the data is small enough to be fully loaded
into the cache, the cost of loading from memory in the third step
becomes minimal. During this stage, a small number and simplicity of
instructions in the first and second steps are crucial for speed. A hash
table and hash function combination capable of this
(ska::flat_hash_map and std::hash) is the
fastest.
When the amount of data slightly increases, and collisions become
more frequent, hash tables that avoid collisions can gain an advantage,
such as fph::DynamicFphMap.
absl::flat_hash_map, which uses SIMD to resolve collisions
to a certain degree, can also hold an advantage with certain data
sizes.
When the amount of data continues to increase to the point where any
memory access results in a cache miss, the hash table with the least
memory accesses becomes the fastest. This requires fewer hash
collisions, and also highlights that any metadata will increase the
number of memory fetches at this stage. ska::flat_hash_map
and tsl::robin_map are the fastest choices at this
point.
In conclusion, a high-performance hash table and hash function
combination will depend on several factors. When data fits entirely
within cache, ska::flat_hash_map with
std::hash performs best, due to the minimal instructions
required in the first two steps. As data size increases, hash tables
that can effectively avoid collisions or utilize SIMD to resolve them
gain the advantage, such as fph::DynamicFphMap and
absl::flat_hash_map. Lastly, when data size surpasses cache
capacity, hash tables that minimize memory accesses, like
ska::flat_hash_map and tsl::robin_map, come
out on top.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
When the size of value_type expands to 64 bytes, cache
can accommodate fewer elements. Additionally, a cache line can only hold
a single element, making hash table collisions more costly. However,
modern CPUs' powerful prefetching capabilities typically keep these
costs low for hash tables using linear or quadratic probing.
Overall, the performance between hash tables remains mostly
consistent with the <uint64_t, uint64_t> case. The
combination of ska::flat_hash_map and
std::hash is fastest when data volume is low or high, while
the combination of fph::DynamicFphMap and
std::hash excels with medium-sized data.
Use large max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
Prior tests used default load factors. However, different hash tables often use different load factors, leading to performance differences due to distinct memory size requirements.
Hash tables respond differently to increased maximum load factors. While larger load factors reduce memory usage and potential cache miss rates, improving performance and minimizing footprint, they can also increase hash table collisions, slowing lookup speed.
Thus, if hash table performance is sensitive to load factor and
collision probability, larger load factors can degrade performance, as
seen with ska::flat_hash_map and
tsl::robin_map. However, for tables less affected by load
factor, such as fph::DynamicFphMap and
absl::flat_hash_map, performance remains stable or even
improves with larger load factors.
The comparison between larger and default
max_load_factor confirms these observations.
ska::flat_hash_map and std::hash remain the
fastest combination with few elements. As element count rises,
performance of ska::flat_hash_map and
tsl::robin_map falls while fph::DynamicFphMap
and fph::MetaFphMap performance generally increases. Other
hash tables exhibit little change. Thus, for larger data sizes,
fph::DynamicFphMap and std::hash offer the
fastest lookup, followed by combinations of fph::MetaFphMap
and std::hash, and absl::flat_hash_map with
absl::hash.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
When the value_type size is 64 bytes, increasing
max_load_factor gives results similar to
<uint64_t, uint64_t>. ska::flat_hash_map
with std::hash is fastest for fewer elements, while
fph::DynamicFphMap with std::hash takes the
lead with more elements.
Lookup keys not in the table (miss)
Use default max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
Finding nonexistent elements in a hash table differs from locating existing ones. Full comparisons confirm whether the target key equals a stored key, but different hash values are enough to prove that the keys are different. Therefore, hash tables that store hash values or partial hashes can speed up this task. In particular, when partial hash values (e.g., 1 byte) are used as metadata, they use less cache space than complete keys.
Although hash tables like absl::flat_hash_map,
r_h::unordered_flat_map, and fph::MetaFphMap
use partial hashes as metadata, they are not always fastest for small
lookups because the additional instruction cost can outweigh the cache
savings. However, the advantage of this approach appears once the L1
cache is insufficient.
On Intel Rocket Lake, fph::MetaFphMap with
std::hash is fastest when element count exceeds 6,000, and
stays ahead almost the whole way up; only at the very largest counts
(around 10 million) does absl::flat_hash_map with
absl::Hash edge ahead. On M1 Max,
ska::flat_hash_map is fastest for under 3,000 elements,
closely followed by fph::DynamicFphMap. For 6,000 to
150,000 elements, fph::DynamicFphMap leads, but
fph::MetaFphMap prevails above 200,000 elements. The M1
Max's larger cache capacity likely influences this variation, with
metadata-based methods gaining advantage at higher element counts.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
When using a 64-byte value_type, the overall situation
is similar to <uint64_t, uint64_t>. On M1 Max,
fph::MetaFphMap starts to dominate from 45,000 elements.
This change is caused by the fact that the number of elements the cache
can hold becomes smaller.
Use large max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
When a larger maximum load factor is set, the overall ranking is
consistent with the default maximum load factor. The range where
ska::flat_hash_map is ahead is reduced, because its
performance is sensitive to load factor. On Intel Rocket Lake,
ska::flat_hash_map with std::hash is the
fastest when the number of elements is not greater than 1024. With more
elements, fph::MetaFphMap with std::hash is
faster. On M1 Max, the range where ska::flat_hash_map is
ahead is also smaller.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
As above, the overall relative speed relationship is similar to that when using the default load factor.
Lookup keys with a 50% probability in the table
Use default max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
When the lookup key has a 50% probability of being in a hash table, the throughput decreases due to the increased lookup time caused by the branch prediction failure penalty. The lookup time gap between hash tables narrows. We have to show only the fastest hash function for each hash table by default to make the graph a bit clearer and easier to read.
From the above graphs, we can see that the performance of the hash
tables can be considered very close, except for
std::unordered_map.
When the number of elements is relatively small,
ska::flat_hash_map with std::hash is still the
fastest at most data points.
On Intel Rocket Lake, when the number of elements is between 25,000
and 2,200,000, fph::MetaFphMap and
fph::DynamicFphMap are the fastest when paired with
std::hash. absl::flat_hash_map and
ska::flat_hash_map are very close in performance on some
data points. When the number of elements is in the range of 2,200,000 to
6,000,000, absl::flat_hash_map with absl::hash
is the fastest pair. When the number of elements is not less than
6,000,000, ska::flat_hash_map with std::hash
returns to first place. The gap is very small, and the leading table
changes with the element count.
On the M1 Max, fph::MetaFphMap and
absl::flat_hash_map are the fastest hash tables when the
number of elements is greater than 32,768. The performance of
absl::flat_hash_map fluctuates with the number of elements
while fph::MetaFphMap is relatively stable. When the number
of elements is extremely large, ska::flat_hash_map returns
to first place.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
Use large max_load_factor
<K,V>: <uint64_t with several split bits masked, uint64_t>
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
Hit Appendix
Use default max_load_factor
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Use large max_load_factor
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Miss Appendix
Use default max_load_factor
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Use large max_load_factor
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
50% Hit Appendix
Use default max_load_factor
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
Use large max_load_factor
<K,V>: <uint64_t with high position bits masked, uint64_t>
<K,V>: <uint64_t with low position bits masked, uint64_t>
<K,V>: <uint64_t uniformly distributed, uint64_t>
- Link: https://renzibei.com/en/int-lookup-throughput/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.