Hash Table Benchmark
This is yet another benchmark to compare different hash tables (hash maps) with different hash functions in C++, attempting to evaluate the performance of the lookup, insertion, deletion, iteration, etc. on different datasets as comprehensively as possible.
We show the performance of hash tables with hash functions for different operations on different types of datasets of different sizes. The reader can refer to these results and choose the hash table and hash function that best match the target application.
The benchmarks were collected in 2022–2023 (machine configurations are listed below); this writeup was compiled and published in 2026.
Before viewing the benchmark results
Anyone familiar enough with hash tables knows that even a well-known and widely used hash table can have a data distribution it is not very good at. In other words, no hash table is the fastest on all datasets for all operations.
The best practice for selecting a hash table is to consider the data characteristics, operation mix, requirements, and hash function together.
This benchmark tries to use a concise and effective method to test the performance of different operations of the hash tables on some of the most common data distributions. But there will always be data distributions that are very different from the data we use for testing, and different users have different requirements for different indicators. Therefore, the best test is still one in the real application.
Methodology
Test Items
We measure combinations of different hash tables with different hash functions. For each combination, we measured its insert, delete, lookup (including successful and failed lookups), and iteration performance under different data. Below is a more detailed table of test items. Please note that in the following we will use "hash table" or "hash map" interchangeably to refer to the same concept.
| Index | Test items | Notes |
|---|---|---|
| 1 | Insert with reserve | Call map.reserve(n) before insert n elements |
| 2 | Insert without reserve | Insert n elements without prior reserve |
| 3 | Erase and insert | Repeatedly do one erase after one insert, keep the map size constant |
| 4 | Look up keys in the map (hit) | Repeatedly look up the elements that are in the map |
| 5 | Look up keys that are not in the map (miss) | Repeatedly look up the elements that are not in the map |
| 6 | Look up keys with 50% probability in the map | Repeatedly look up the elements that have a 50% probability in the map |
| 7 | Look up keys in the map with large max_load_factor (hit) | Same as Test Item 4 except that the map is set a max_load_factor of 0.9 and rehashed before the lookup operations |
| 8 | Look up keys that are not in the map with large max_load_factor (miss) | Same as Test Item 5 except that the map is set a max_load_factor of 0.9 and rehashed before the lookup operations |
| 9 | Look up keys with 50% probability in the map with large max_load_factor | Same as Test Item 6 except that the map is set a max_load_factor of 0.9 and rehashed before the lookup operations |
| 10 | Iterate the table | Iterate the whole table several times |
| 11 | Heap memory size and load factor with default and large max_load_factor | Record the heap memory size and load factor when constructing the map in Test Items 4 and 7 |
As may be noticed, several test items measure lookup speed with a
larger upper limit on the load factor. The load factor measures how full
the hash table is, and max_load_factor is the STL API for
controlling its upper bound. This is because each hash table may have a
different expansion strategy and max_load_factor, so even
with the same number of elements, different tables may choose different
load factors and occupy different amounts of memory. The load factor and
memory footprint greatly affect lookup performance, so using a hash
table with a smaller max_load_factor may have worse (or
better) lookup performance. On the other hand, a higher load factor may
lead to a higher probability of collision, thus reducing lookup
performance.
In addition, extreme lookup performance usually requires making the
space used by the hash table as small as possible to reduce the cache
miss rate. When available memory is very limited, a larger load factor
may also be preferred. One way to do this is to set a higher
max_load_factor, and then rehash (or set a large
max_load_factor before the main construction process of the
table).
For each of the tests above, we tested the throughput and latency (when the platform under test meets the conditions for the latency test). The throughput results will be more representative, because modern software runs on CPUs with pipelined architectures. And almost all operations will have other instructions before and after them, which can make full use of the pipeline. However, for some specific uses, latency data is important. The latency measurement results here are for reference only for special needs and have relatively large limitations.
Dataset
All the data used in the benchmark are randomly generated; the user can choose different seeds for the test data. We tested the performance of each hash table at different sizes from 32 to 10^7.
The tested keys consist of 64-bit integers of different distributions and strings of different lengths. The detailed test data is shown in the table below.
| Index | Key Type | Value Type | Notes |
|---|---|---|---|
| 1 | uint64_t with several split bits masked | uint64_t | The keys have such characteristics: only bits in some positions may
be 1, and all other bits are 0. For test data of size n, at most
ceil[log2(n)] fixed bits may be 1. e.g. If the key type is uint8_t (it
is uint64_t in reality) and the test size is 7, the keys will be
generated with the method rng() & 0b10010001. The
distribution characteristics of such bits can relatively comprehensively
examine whether hash tables and hash functions can handle keys that only
have effective information in specific bit positions. |
| 2 | uint64_t, uniformly distributed in [0, UINT64_MAX] | uint64_t | The keys follow a uniform distribution in the range [0, UINT64_MAX]. |
| 3 | uint64_t, bits in high position are masked out | uint64_t | The bits in the high position are set to 0. For test data of size n,
at most ceil[log2(n)] fixed bits may be 1. For example, if the key type
is uint8_t (uint64_t in reality) and the test size is 7, the keys will
be generated with the method rng() & 0b00000111 |
| 4 | uint64_t, bits in low position are masked out | uint64_t | The bits in the low position are set to 0. For test data of size n,
at most ceil[log2(n)] fixed bits may be 1. For example, if the key type
is uint8_t (uint64_t in reality) and the test size is 7, the keys will
be generated with the method rng() & 0b11100000 |
| 5 | uint64_t with several bits masked | 56 bytes struct | The keys are the same as the distribution of the data 1. The payload
is a 56 bytes long struct, which makes the
sizeof(std::pair<key, value>)==64 |
| 6 | Small string with a max length of 12 | uint64_t | The key type is a string with a maximum length of 12. Both length and characters are randomly generated. The compiler/library may use Small String Optimization (SSO). |
| 7 | Small string with a fixed length of 12 | uint64_t | The key type is a string with a fixed length of 12. The characters are randomly generated. The compiler/library may use Small String Optimization (SSO). |
| 8 | Mid string with a max length of 24 | uint64_t | The key type is a string with a maximum length of 24. Both length and characters are randomly generated. |
| 9 | Mid string with a fixed length of 24 | uint64_t | The key type is a string with a fixed length of 24. The characters are randomly generated. |
| 10 | Large string with a max length of 64 | uint64_t | The key type is a string with a maximum length of 64. Both length and characters are randomly generated. |
| 11 | Large string with a fixed length of 64 | uint64_t | The key type is a string with a fixed length of 64. The characters are randomly generated. |
Different distributions within the range representable by uint64_t are chosen as keys. Uniformly distributed integers in the range of uint64_t are the easiest to generate with pseudo-random numbers, but they are rare in real situations.
If users are concerned with performance using integer keys, we strongly recommend focusing on the results on the first dataset rather than the second dataset (i.e. the dataset with a uniform random distribution). The data from the first dataset can better examine the ability of hash tables and hash functions to deal with more diverse patterns, while the test on uniform random distributions barely verifies the ability of hash tables to handle other distributions. Moreover, in real data distributions, few keys happen to be uniformly randomly distributed over the [0, 2^63 - 1] range.
With this in mind, our analysis for integer keys focuses mainly on the first dataset. To keep the articles shorter and easier to read, the other integer datasets — including the second (uniform random) one and the high-/low-bit-masked ones — are mostly shown only in the appendix of each test, without detailed discussion.
For the string datasets, different character sets are used. For fixed-length strings, the pattern is like the first dataset, where several split bits are masked. In other words, only the bits in some positions may differ among these datasets. This pattern is intended to test the quality of the hash function.
For the strings with variable length, a subset of the printable characters can appear in the string.
Real data distributions are often biased. If a combination of hash function and hash table can only handle one distribution but cannot handle other distributions, this combination is not robust to unknown distributions. If the distribution of the data is known in advance, the user can pick the fastest and most stable hash table for that data.
Tested Hash functions and Hash maps
Below is the list of hash functions we tested.
| Name | Type | Notes | Link |
|---|---|---|---|
| std::hash | Normal | Implemented by compiler; identity hash is used for integer type in libc++ and libstdc++ | |
| absl::hash | Normal | Implemented by Google; Uses 128-bit product of multiplication and an xor-shift. | https://github.com/abseil/abseil-cpp |
| robin_hood::hash | Normal | For integer keys, it uses xor-shift, multiplication, xor-shift; For string keys it is similar to absl::hash | https://github.com/martinus/robin-hood-hashing |
| xxHash_xxh3 | Bytes | Designed for string; We use identity hash for integer type to pass compilation; It won't show in the results of integer keys | https://github.com/Cyan4973/xxHash |
Originally we had some seed hash functions in the tests, which are hash functions that take both a key and a seed as arguments. We removed these hash functions to keep the test subjects simple, and we use the no-seed version of all the hash tables.
We will not show the results of hash xxHash_xxh3 in
tests on integer keys. For the early versions of
absl::Hash, the behavior on the arm64 platform was
different from that on the x86-64 platform, and it was poor for some
datasets. So we once had a uint128_mul::hash to compare
with it, which is similar to the absl::Hash on the x86-64
platform. Since the newest version of absl::Hash has fixed
this problem, we deleted the uint128_mul::hash.
The following table lists the hash tables we tested. Some of these hash tables rely on a "good" hash function to work properly, which can generate hash values that are as uniformly distributed as possible for unbalanced keys. If a hash function that does not have such a property (e.g. identity hash) is used, then the performance of these hash tables may drop drastically. These hash tables may assume that the hash values of the keys from the dataset are uniformly distributed in the output range. This requires hash functions to have properties like uniformity or diffusion.
The implication here is that a "good" hash function tends to be more complex than the simplest hash function (the identity hash), requiring more instructions to complete the computation. Some hash tables do not rely on a good hash function, perhaps because they do some extra work to improve the uniformity of the hash values. For such a hash table, the simpler the hash function, the better, preferably the identity hash. So we should always compare the combinations of hash tables and hash functions, rather than fixing the hash function to compare the hash table, or vice versa.
Here are the hash maps we tested.
| Name | Requires a good hash function | Notes | Link |
|---|---|---|---|
| std::unordered_map | No* | Implemented by the STL library; May differ in libc++ and libstdc++. | |
| ska::flat_hash_map | No | Very fast and simple; Uses robin hood hash; Memory overhead: alignof(value_type) per element; Requires small load factor | https://github.com/skarupke/flat_hash_map |
| ska::bytell_hash_map | No | A little slower than ska::flat_hash_map but one byte per element memory overhead | https://github.com/skarupke/flat_hash_map |
| absl::flat_hash_map | Yes | Uses SIMD and metadata; Fast when looking up keys that are not in the map; One byte per element memory overhead | https://abseil.io/about/design/swisstables |
| absl::node_hash_map | Yes | Slower than absl::flat_hash_map but does not invalidate the pointer after rehash | https://github.com/abseil/abseil-cpp |
| tsl::robin_map | Yes | A fast hash table using robin hood hash; Memory overhead is no less than ska::flat_hash_map | https://github.com/Tessil/robin-map |
| emhash::HashMap7 | Yes | Fast in lookup hit operations. | https://github.com/ktprime/emhash |
| fph::DynamicFphMap | No | A dynamic perfect hash table; Ultra-fast in lookup but slow in insert; 2~8 bits per element memory overhead | https://github.com/renzibei/fph-table |
| fph::MetaFphMap | No | A dynamic perfect hash table using metadata; Better than fph::DynamicFphMap in the miss lookup case. | https://github.com/renzibei/fph-table |
| robin_hood::unordered_flat_map | Yes | A table using robin hood hash; | https://github.com/martinus/robin-hood-hashing |
| ankerl::unordered_dense_map | Yes | Stores entries in a dense array; fastest to iterate; compact footprint | https://github.com/martinus/unordered_dense |
* Note: For the tested libc++ and libstdc++ versions, the libc++
implementation requires a good hash function but libstdc++ has no such
requirement. If using std::hash, the performance can be
poor when the size is the power of 2 for libc++.
At a quick glance, it is easy to see that many of the hash tables listed use the robin hood hashing technique in the pursuit of speed.
Experiments and Results
The code of this benchmark is available at https://github.com/renzibei/hashtable-bench.
Testing Platform
Platform 1: Intel Xeon E-2388G CPU @ 3.20 GHz, boost to 5.1 GHz; x86-64; Rocket Lake.
Platform 2: M1 Max Macbook Pro 16 inch, 2021; arm64; Firestorm.
Due to the lack of a high-precision time stamp counter on the arm64 (M1 Max) platform, we only measured the latency on the x86-64 platform (even the AMD CPU has some problems when measuring the latency using the TSC, so we only test latency on the Intel CPU). In addition, for the x86-64 platform, we have also taken some measures to ensure the stability of the measurement results, including the following:
- Use the
tasksetcommand to set CPU core affinity - Turn off hyperthreading
- Isolate the cores by adding
isolcpus=andrcu_nocbs=inGRUB_CMDLINE_LINUXin/etc/default/grub - Turn off some power-saving options, including disabling the
ondemandsystemd service, and settingidle=pollandintel_idle.max_cstate=0in the grub command line. - Turn off timer tick interrupts, recompile the kernel with
CONFIG_NO_HZ_FULL=yand setnohz_full=in the grub command line. - Other adjustments that de-jitter the system latency. You can refer to https://rigtorp.se/low-latency-guide/.
These measures cannot be done on macOS. But as we do not measure the latency of operations on macOS, it doesn't matter that much.
Results
For throughput data, performance will be represented by the average time per operation. We will plot the average time per operation for different scales of data. The shorter the time, the better the performance.
For the latency data, due to the limitation of the article length, we only show the latency of the 99th percentile in most test cases, which can help to show the worst time complexity and long tail latency of the hash table. And that's not even enough to reflect worst-case latency. For a distribution with long-tailed features, the 0.99th, 0.999th, and 0.9999th quantiles can all have very different values. If the application has strict requirements on real-time performance and tail latency (such as gaming and high-frequency trading), then this data metric should be worth paying attention to.
If too much time is spent in a test, we will count it as the timeout and set the time as zero, and that data point won't be plotted.
You can click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
We divide the results into different groups according to data type and operation type.
- Integer Key
- String Key
- Memory Usage and Load Factor
- Analysis & Conclusion
Conclusion
In short, no single hash table is best for every workload — the right
pick depends on which operations dominate, what the keys look like, and
how much memory is available. If lookups dominate and the table is built
once, the perfect-hash fph::DynamicFphMap (and
fph::MetaFphMap when misses are common) is hard to beat, at
the cost of slow construction. For a general-purpose map with a mix of
inserts and lookups, absl::flat_hash_map with
absl::Hash is a fast, compact and robust default;
ska::flat_hash_map is the quickest while the data stays in
cache, and ankerl::unordered_dense_map is the one to reach
for when iteration is frequent. std::unordered_map is the
slowest in most tests and is worth keeping mainly for its
pointer-stability guarantees.
For the full reasoning behind these picks, with a per-test and per-workload comparison, see the Analysis & Conclusion.
Restrictions
Exclusive access to resources
In our tests, almost all computer resources can be monopolized by the test program, especially cache resources. And this is relatively rare in practical applications. In fact, other processes and tasks may occupy part of the cache. In practical applications users should expect a lower available cache size.
Cold memory and warm cache
We neither did a warmup, nor did we specifically test the cold start scenario. In our tests, we repeatedly test an operation with a range of data many times. Therefore, when the number of operations is much greater than the amount of data, it can be considered that most operations are accessing the warm cache. When the number of operations is less than the number of data, most operations are accessing cold memory. In our test, limited by the test time, when the amount of data is small, the number of operations will be much greater than the amount of data; and when the amount of data is large, the number of operations will be equal to the amount of data.
Huge test space
The size of the test space contains at least 1
|hash table set| x |hash function set| x |data sets| x |operation set| x |hardware platform set| x |compiler set|
Therefore, in order to choose the most suitable hash table and hash function for the user's purpose, real tests should be carried out in the application scenario.
Postscript
This hash table benchmark series took at least four years from start to finish. The first version of the data was already available in 2022, when the M1 Max was still a fairly new CPU; now even the M5 is out. It took this long because organizing so many charts and analyses was genuinely tedious. I finished the main body of the posts roughly between 2022 and 2023, but left a lot of cleanup work undone out of laziness. Because this kept hanging over me, the blog also fell into a kind of head-of-line blocking: I kept thinking, "I still haven't finished this series," and ended up not publishing other posts either.
Over these years, many things have changed. CPUs have gone through several generations, new hash functions and hash tables have appeared, and LLM technology has also advanced rapidly. Many things discussed in these posts may already be somewhat out of date. All I can say is: time passes like a river, never stopping day or night.
Whatever the quality of this benchmark series may be, I have decided to stop iterating on it and publish it as it is.
- Link: https://renzibei.com/en/hashtable-bench/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.