Hash Table Benchmark - Integer Lookup Throughput

  1. Exploring the Connection between Lookup Performance and Memory Hierarchy
  2. Lookup keys in the table (hit)
    1. Use default max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
    2. Use large max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
  3. Lookup keys not in the table (miss)
    1. Use default max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
    2. Use large max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
  4. Lookup keys with a 50% probability in the table
    1. Use default max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
    2. Use large max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
  5. Hit Appendix
    1. Use default max_load_factor
      1. <K,V>: <uint64_t with high position bits masked, uint64_t>
      2. <K,V>: <uint64_t with low position bits masked, uint64_t>
      3. <K,V>: <uint64_t uniformly distributed, uint64_t>
    2. Use large max_load_factor
      1. <K,V>: <uint64_t with high position bits masked, uint64_t>
      2. <K,V>: <uint64_t with low position bits masked, uint64_t>
      3. <K,V>: <uint64_t uniformly distributed, uint64_t>
  6. Miss Appendix
    1. Use default max_load_factor
      1. <K,V>: <uint64_t with high position bits masked, uint64_t>
      2. <K,V>: <uint64_t with low position bits masked, uint64_t>
      3. <K,V>: <uint64_t uniformly distributed, uint64_t>
    2. Use large max_load_factor
      1. <K,V>: <uint64_t with high position bits masked, uint64_t>
      2. <K,V>: <uint64_t with low position bits masked, uint64_t>
      3. <K,V>: <uint64_t uniformly distributed, uint64_t>
  7. 50% Hit Appendix
    1. Use default max_load_factor
      1. <K,V>: <uint64_t with high position bits masked, uint64_t>
      2. <K,V>: <uint64_t with low position bits masked, uint64_t>
      3. <K,V>: <uint64_t uniformly distributed, uint64_t>
    2. Use large max_load_factor
      1. <K,V>: <uint64_t with high position bits masked, uint64_t>
      2. <K,V>: <uint64_t with low position bits masked, uint64_t>
      3. <K,V>: <uint64_t uniformly distributed, uint64_t>

The integer key lookup throughput test.

Click the labels on the legend to view or hide data lines corresponding to specific hash tables and hash functions.

In this test, we measure the lookup performance of hash tables in three kinds of situations:

  1. Look up the keys in the hash table (hit or successful find).
  2. Look up the keys not in the hash table (miss or unsuccessful find).
  3. Look up keys with a 50% probability of being in the hash table.

Exploring the Connection between Lookup Performance and Memory Hierarchy

Before looking at the details of lookup speed, it is important to note the strong correlation between hash table lookup speed and the cache hit rate. With integer key-value pairs, the hash table lookup operation mainly involves loading content from memory, which consumes most of the operation time.

Modern computer systems use a hierarchical memory design. For instance, Intel Rocket Lake has registers, L1 cache, L2 cache, L3 cache, and DRAM, with speed decreasing in that order. The M1 Max has L1 cache, L2 cache, SLC cache, and DRAM.

As the cache miss rate at a given level rises, the overall lookup time becomes limited by the memory hardware speed of the next, slower level. Besides cache misses, TLB misses also contribute to the time penalty. The chart below helps illustrate this concept. It shows the P50 (median) hit-lookup latency on the Xeon E-2388G. The structure is easiest to read by isolating a single open-addressing table such as ska::flat_hash_map with std::hash (click the other legend entries to hide them). For such a table the median lookup succeeds after a single key comparison, so the P50 value mostly reflects the latency of one memory load and tracks the memory hierarchy closely.

The figure above shows that on the Intel Rocket Lake architecture, lookup performance separates into four tiers, each determined by the number of elements:

  1. Elements that fully fit into the L1 cache. Given a sizeof(value_type) of 16 bytes and an L1 data cache of 48 KB, this can hold approximately 3,072 elements. Since the default load factor of ska::flat_hash_map is usually less than 0.5, this range in the figure lies between 32 and 1,024.
  2. Elements that fully fit into the L2 cache. Given a sizeof(value_type) of 16 bytes and an L2 cache of 512 KB, this can hold approximately 32,768 elements. For ska::flat_hash_map in this test, the range is 1,500 to 8,192.
  3. Elements that fully fit into the L3 cache. Given a sizeof(value_type) of 16 bytes and an L3 cache of 16 MB, this can hold approximately 1,048,576 elements. For ska::flat_hash_map in this test, the range is 12,000 to 400,000.
  4. The case where reading from RAM becomes inevitable. This typically occurs when the L3 cache can't hold all the elements, generally when there are more than 1,048,576 elements.

In addition to cache misses, TLB misses also have a significant influence on hash table lookup speed. Rocket Lake has 64 L1 DTLB (in 4KB mode, 32 in 2MB mode) entries and 1536 STLB entries. This means if a 4KB page size is used, considerable L1 TLB misses will occur when the number of elements exceeds 16,384, and L2 TLB misses when the number exceeds 393,216. Using huge pages can greatly reduce TLB misses. On macOS with M1 Max, a 16 KB page size is used by default, resulting in much smaller penalties due to TLB misses compared to the default 4 KB pages on the Rocket Lake platform. In the P50 curve this TLB-miss penalty adds an extra rise once the working set pushes the page table beyond the reach of the L2 TLB, on top of the cache-tier steps described above.

The M1 Max can utilize 128KB L1 data cache, 12MB L2 cache, and 48MB SLC cache per thread, resulting in a higher cache hit rate and therefore superior performance for most data sizes in the hash table test.

Lookup keys in the table (hit)

Use default max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

Broadly speaking, the lookup time for a hash table can be split into four parts:

  1. Time required for the hash function to compute the hash value.
  2. Time needed to map the hash value to a specific memory address.
  3. Time taken to load content from the given memory address.
  4. Time spent comparing the loaded content with the target key value, with additional penalty time if the comparison is unsuccessful.

When the element count is low, the CPU cache can often hold all elements. Here, the third step is relatively fast, so the other three components become important. The first and second steps can trade work with each other. If the hash function (in the first step) does not distribute elements well across the hash space, the second step needs extra computation to distribute these non-uniform hash values into the underlying slot array. If the first step uses a high-quality hash function that uniformly distributes hash values, the second step typically only needs a quick bitwise AND instruction for truncation (if the number of slots is a power of 2).

This can be seen from the charts above. Both libc++ and libstdc++'s implementations of std::hash use the identity hash for uint64_t, making the first step extremely fast. If a bitwise AND instruction is directly used to get the slot-array index in the second step, there will be many collisions, leading to redundant comparisons in the fourth step. Hash tables that use a simple method in the second step, such as absl::flat_hash_map, absl::node_hash_map, emhash::hash_map7 and tsl::robin_map require high-quality hash functions, so std::hash is not enough. robin_hood::unordered_flat_map goes slightly further in the second step but doesn't achieve optimal performance with std::hash. libc++'s implementation of std::unordered_map also shows poor performance when using std::hash and the number of elements is a power of 2. robin_hood::hash, though providing insufficient hash quality for most hash tables demanding good hash functions, is good enough for robin_hood::unordered_flat_map, as illustrated in the range 100,000 to 3,100,000 data points.

In contrast, hash tables that do extra work in the second step can still achieve good performance even with the simplest identity hash in the first step, as seen in ska::flat_hash_map, ska::bytell_hash_map, fph::DynamicFphMap, fph::MetaFphMap and libstdc++'s std::unordered_map. I recommend checking out the innovative trick ska::flat_hash_map uses in the second step, called Fibonacci Hashing: The Optimization that the World Forgot (or: a Better Alternative to Integer Modulo). This method uses a 64-bit multiplication and a right-shift instruction. Since there are no arithmetic instructions involved in the hash function (for identity hash), these are pretty much all the arithmetic instructions the CPU's ALU has to handle.

On the other hand, a hash table requiring a "good" hash function and a bitwise AND instruction for the second step lacks a hash function that can be implemented with a cost equal to or less than these two instructions, and still achieve satisfactory hash quality for most data distributions, to the best of my knowledge.

For this reason, the combination of ska::flat_hash_map and std::hash currently ranks as the fastest when all the data fits into the L2 cache on Intel Rocket Lake (or L1 cache on M1 Max). Almost any other combination requires more instructions in the first and second steps combined. Within this data range, the tsl::robin_map with absl::hash is the second fastest on Rocket Lake, with a very minor difference. On M1 Max, the fph::DynamicFphMap with std::hash combination is almost just as fast, differing by less than 0.1 nanoseconds per operation.

However, when the L2 cache can't hold all elements but the L3 cache can, ska::flat_hash_map no longer holds the top spot. On Intel Rocket Lake, the combination of fph::DynamicFphMap and std::hash proves to be the fastest within the range of 8,192 to 3,100,000 elements (though absl::flat_hash_map with absl::hash is slightly faster at 400,000 elements). On M1 Max, this combination also excels when the element count is within 8,192 to 1,200,000. fph::MetaFphMap is the runner-up in this range. Considering that both ska::flat_hash_map and tsl::robin_map use an aggressive expansion strategy leading to a low load factor and earlier cache misses, and that fph::DynamicFphMap uses perfect hashing (i.e., no conflict), these results make sense.

When the data can't fit into the L3 cache, ska::flat_hash_map with std::hash becomes the fastest combination once more. The tsl::robin_map and absl::hash combination follows closely, with the difference of one or two instructions becoming almost negligible compared to the cost of cache misses. These two are still the fastest even with a massive number of elements, partially because many other hash tables use auxiliary arrays to store additional information. For example, absl::flat_hash_map employs a metadata array which, when dealing with large data, can have a high and costly cache miss rate. In contrast, ska::flat_hash_map and tsl::robin_map simply use one slot array to store key-value data, allowing one memory-to-cache line load to suffice.

Analyzing this chain of observations, we find that high-performance hash tables tend to have few failed comparisons in the fourth lookup step (either ska::flat_hash_map limits failures, or fph::DynamicFphMap has no failures as a perfect hash table).

Under this premise, when the data is small enough to be fully loaded into the cache, the cost of loading from memory in the third step becomes minimal. During this stage, a small number and simplicity of instructions in the first and second steps are crucial for speed. A hash table and hash function combination capable of this (ska::flat_hash_map and std::hash) is the fastest.

When the amount of data slightly increases, and collisions become more frequent, hash tables that avoid collisions can gain an advantage, such as fph::DynamicFphMap. absl::flat_hash_map, which uses SIMD to resolve collisions to a certain degree, can also hold an advantage with certain data sizes.

When the amount of data continues to increase to the point where any memory access results in a cache miss, the hash table with the least memory accesses becomes the fastest. This requires fewer hash collisions, and also highlights that any metadata will increase the number of memory fetches at this stage. ska::flat_hash_map and tsl::robin_map are the fastest choices at this point.

In conclusion, a high-performance hash table and hash function combination will depend on several factors. When data fits entirely within cache, ska::flat_hash_map with std::hash performs best, due to the minimal instructions required in the first two steps. As data size increases, hash tables that can effectively avoid collisions or utilize SIMD to resolve them gain the advantage, such as fph::DynamicFphMap and absl::flat_hash_map. Lastly, when data size surpasses cache capacity, hash tables that minimize memory accesses, like ska::flat_hash_map and tsl::robin_map, come out on top.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

When the size of value_type expands to 64 bytes, cache can accommodate fewer elements. Additionally, a cache line can only hold a single element, making hash table collisions more costly. However, modern CPUs' powerful prefetching capabilities typically keep these costs low for hash tables using linear or quadratic probing.

Overall, the performance between hash tables remains mostly consistent with the <uint64_t, uint64_t> case. The combination of ska::flat_hash_map and std::hash is fastest when data volume is low or high, while the combination of fph::DynamicFphMap and std::hash excels with medium-sized data.

Use large max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

Prior tests used default load factors. However, different hash tables often use different load factors, leading to performance differences due to distinct memory size requirements.

Hash tables respond differently to increased maximum load factors. While larger load factors reduce memory usage and potential cache miss rates, improving performance and minimizing footprint, they can also increase hash table collisions, slowing lookup speed.

Thus, if hash table performance is sensitive to load factor and collision probability, larger load factors can degrade performance, as seen with ska::flat_hash_map and tsl::robin_map. However, for tables less affected by load factor, such as fph::DynamicFphMap and absl::flat_hash_map, performance remains stable or even improves with larger load factors.

The comparison between larger and default max_load_factor confirms these observations.

ska::flat_hash_map and std::hash remain the fastest combination with few elements. As element count rises, performance of ska::flat_hash_map and tsl::robin_map falls while fph::DynamicFphMap and fph::MetaFphMap performance generally increases. Other hash tables exhibit little change. Thus, for larger data sizes, fph::DynamicFphMap and std::hash offer the fastest lookup, followed by combinations of fph::MetaFphMap and std::hash, and absl::flat_hash_map with absl::hash.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

When the value_type size is 64 bytes, increasing max_load_factor gives results similar to <uint64_t, uint64_t>. ska::flat_hash_map with std::hash is fastest for fewer elements, while fph::DynamicFphMap with std::hash takes the lead with more elements.

Lookup keys not in the table (miss)

Use default max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

Finding nonexistent elements in a hash table differs from locating existing ones. Full comparisons confirm whether the target key equals a stored key, but different hash values are enough to prove that the keys are different. Therefore, hash tables that store hash values or partial hashes can speed up this task. In particular, when partial hash values (e.g., 1 byte) are used as metadata, they use less cache space than complete keys.

Although hash tables like absl::flat_hash_map, r_h::unordered_flat_map, and fph::MetaFphMap use partial hashes as metadata, they are not always fastest for small lookups because the additional instruction cost can outweigh the cache savings. However, the advantage of this approach appears once the L1 cache is insufficient.

On Intel Rocket Lake, fph::MetaFphMap with std::hash is fastest when element count exceeds 6,000, and stays ahead almost the whole way up; only at the very largest counts (around 10 million) does absl::flat_hash_map with absl::Hash edge ahead. On M1 Max, ska::flat_hash_map is fastest for under 3,000 elements, closely followed by fph::DynamicFphMap. For 6,000 to 150,000 elements, fph::DynamicFphMap leads, but fph::MetaFphMap prevails above 200,000 elements. The M1 Max's larger cache capacity likely influences this variation, with metadata-based methods gaining advantage at higher element counts.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

When using a 64-byte value_type, the overall situation is similar to <uint64_t, uint64_t>. On M1 Max, fph::MetaFphMap starts to dominate from 45,000 elements. This change is caused by the fact that the number of elements the cache can hold becomes smaller.

Use large max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

When a larger maximum load factor is set, the overall ranking is consistent with the default maximum load factor. The range where ska::flat_hash_map is ahead is reduced, because its performance is sensitive to load factor. On Intel Rocket Lake, ska::flat_hash_map with std::hash is the fastest when the number of elements is not greater than 1024. With more elements, fph::MetaFphMap with std::hash is faster. On M1 Max, the range where ska::flat_hash_map is ahead is also smaller.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

As above, the overall relative speed relationship is similar to that when using the default load factor.

Lookup keys with a 50% probability in the table

Use default max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

When the lookup key has a 50% probability of being in a hash table, the throughput decreases due to the increased lookup time caused by the branch prediction failure penalty. The lookup time gap between hash tables narrows. We have to show only the fastest hash function for each hash table by default to make the graph a bit clearer and easier to read.

From the above graphs, we can see that the performance of the hash tables can be considered very close, except for std::unordered_map.

When the number of elements is relatively small, ska::flat_hash_map with std::hash is still the fastest at most data points.

On Intel Rocket Lake, when the number of elements is between 25,000 and 2,200,000, fph::MetaFphMap and fph::DynamicFphMap are the fastest when paired with std::hash. absl::flat_hash_map and ska::flat_hash_map are very close in performance on some data points. When the number of elements is in the range of 2,200,000 to 6,000,000, absl::flat_hash_map with absl::hash is the fastest pair. When the number of elements is not less than 6,000,000, ska::flat_hash_map with std::hash returns to first place. The gap is very small, and the leading table changes with the element count.

On the M1 Max, fph::MetaFphMap and absl::flat_hash_map are the fastest hash tables when the number of elements is greater than 32,768. The performance of absl::flat_hash_map fluctuates with the number of elements while fph::MetaFphMap is relatively stable. When the number of elements is extremely large, ska::flat_hash_map returns to first place.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

Use large max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

Hit Appendix

Use default max_load_factor

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Use large max_load_factor

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Miss Appendix

Use default max_load_factor

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Use large max_load_factor

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

50% Hit Appendix

Use default max_load_factor

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Use large max_load_factor

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>


← Back to Hash Table Benchmark index