Hash Table Benchmark - Integer Insert and Construct

The integer key insert test.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we measure the time spent constructing the hash table. The construction is done with the emplace( const value_type& value ) operation. We test both with and without a reserve before inserting. The time taken by reserve is not counted in the total time.

We originally used the insert( const value_type &value) function to build the hash table, but we found that some hash tables do not use std::pair<const Key, T> as their value_type (the type used by the STL container). As a result, we cannot use the insert function uniformly across all the tables we test.

During the test, each hash table is constructed many times, and we record the total insertion time in the throughput test.

Throughput

The y axis value is the average time per operation. This result is obtained by time/op = sum{construct time} / (number of construct * number of elements).

<K,V>: <uint64_t with several split bits masked, uint64_t>

Insert with reserve

When analyzing insert performance after reserve, the first obvious thing is that some hash tables, such as emhash::hash_map7, tsl::robin_map, absl::node_hash_map and absl::flat_hash_map, are not suitable for use with std::hash. They need a better hash function. Of course, there are some hash tables that haven't exposed this problem in the insert test yet, and they will show this problem in the later lookup test.

When we remove these hash tables that do not use the appropriate hash function (the reader can click the labels in the legend to hide the data points of these hash tables), we can see that the slowest hash tables to insert are the perfect-hash fph::DynamicFphMap and fph::MetaFphMap, by a large margin. This is the honest tradeoff for their fast lookups: building a perfect hash is expensive, and the cost grows with the number of elements. On the Xeon E-2388G, the fph tables already take 37 to 51 ns per insert at 1024 elements (vs about 20 ns for std::unordered_map and 4 to 6 ns for the fastest flat tables), so even with a small table the gap is real. As the element count grows the gap widens further: at 10^7 elements the fph insert time is about 1450 to 1900 ns, roughly 12 to 15 times that of std::unordered_map (which sits near 125 ns) on the Intel CPU, and about 11 to 12 times on the Apple M1 Max (where the fph tables are near 1550 ns and std::unordered_map near 134 ns).

In addition, we unexpectedly found that the combination of std::unordered_map and std::hash is surprisingly slow for some data points on the M1 Max. Further observation shows that the number of elements of these data points is exactly a power of two: at 1024 elements this combination needs about 165 ns per insert, at 2048 about 311 ns, and at 8192 it explodes to roughly 2500 ns, while at the neighboring non-power-of-two sizes (e.g. 800, 1500, 3000, 6000) it stays around 20 to 32 ns. We infer this comes from the libc++ unordered_map implementation used by clang on the M1: it takes the hash value modulo the number of buckets to obtain a slot index. When the bucket count is exactly a power of two, that modulo degenerates into keeping only the low-order bits of the hash and discarding the high-order bits. Since std::hash for integers is the identity function, any entropy that lives in the high bits is thrown away, which produces heavy collisions at exactly the power-of-two sizes.

When looking for the fastest hash table for insert operations, the results are a little different on the x86-64 and arm64 architectures.

On Intel Rocket Lake, several flat tables are essentially tied at small sizes: emhash::hash_map7 with absl::Hash, ska::flat_hash_map with std::hash, and ankerl::unordered_dense_map all sit around 4 to 5 ns per insert when the element count is no larger than a few thousand. In the medium range, absl::flat_hash_map with absl::Hash is the most consistent winner: it stays near 6 to 12 ns from 6,000 up to about 800,000, where the other flat tables start to climb faster. At the largest sizes the picture shifts again, and in an irregular way. tsl::robin_map with robin_hood::hash is actually the fastest at 10^7 elements (about 25 ns), with ankerl::unordered_dense_map and ska::flat_hash_map close behind (about 27 to 28 ns). Yet the very same combination slumps in the 200,000 to 1,200,000 range (rising to about 30 to 42 ns) before recovering — a non-monotonic profile that suggests robin_hood::hash produces a weaker distribution for these masked-bit keys in that range rather than a clean cache-tier effect. robin_hood::unordered_flat_map, on the other hand, does not suffer from this hash-quality problem: it stays competitive with either absl::Hash or robin_hood::hash. This means that though it requires a good hash function, it does not rely on the hash quality as much as other hash tables like absl::flat_hash_map.

On Apple M1 Max, the story is a little different. When the element count is no larger than about 6,000, ska::flat_hash_map with std::hash is the fastest, dipping below 4 ns per insert. When the number of elements is larger than 6,000, absl::flat_hash_map with absl::Hash takes the lead, staying around 4.5 to 10 ns all the way to about 1,200,000 while the others trail. But when the number of elements approaches 10^7, the gap between these flat maps shrinks and they converge to roughly 20 to 22 ns. This is because the working set no longer fits in cache when there are many elements. This phenomenon can be observed more clearly in the <K,V>: <uint64_t with several split bits masked, 56 bytes struct> data because cache is under more pressure in that situation.

Insert without reserve

When there is no reserve operation before insert, the rankings of these hash tables change again.

On Intel Rocket Lake, absl::flat_hash_map with absl::Hash is almost always the fastest across all data scales (about 16 to 44 ns), with emhash::hash_map7 and ankerl::unordered_dense_map close behind. When the number of elements is small, tsl::robin_map and ska::flat_hash_map are also nearly as fast, but they fall back in the larger ranges because their aggressive growth strategy triggers more rehashing.

On Apple Silicon, ska::flat_hash_map with std::hash is the fastest at the smallest sizes (about 14 ns at 1024). When the number of elements grows beyond a few thousand, absl::flat_hash_map with absl::Hash again becomes the fastest, staying around 20 to 32 ns across the rest of the range. emhash::hash_map7 with absl::Hash comes a close second.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

Insert with reserve

The pattern of keys is the same as the <K,V>: <uint64_t with several split bits masked, uint64_t> . The difference is that the size of the pair is 4 times larger: 64 bytes per element instead of the original 16 bytes. Therefore, the working set runs out of cache earlier.

On Intel Rocket Lake, at small sizes ankerl::unordered_dense_map, ska::flat_hash_map with std::hash and absl::flat_hash_map with absl::Hash are all close (about 5 to 8 ns). In the medium range, absl::flat_hash_map with absl::Hash is the fastest, staying near 10 to 15 ns up to about 200,000 elements. But at the largest sizes absl::flat_hash_map falls behind: at 10^7 it costs about 65 ns while ska::flat_hash_map (about 43 ns) and tsl::robin_map with absl::Hash (about 40 ns) are clearly faster. This is because absl::flat_hash_map keeps its metadata and slots in two separate arrays, so once the working set no longer fits in cache it pays two memory loads per probe; the larger 64-byte element makes this penalty more visible.

On Apple Silicon, ska::flat_hash_map with std::hash is the fastest when the number of elements is small (no larger than about 6,000). For the medium-to-large range, absl::flat_hash_map with absl::Hash takes the lead, holding it up to about 1,200,000 elements. When the element count is higher than that, ska::flat_hash_map again pulls roughly even with or ahead of absl::flat_hash_map. This is because when the data reaches this scale, the M1 Max runs out of cache, and absl::flat_hash_map relies on cache to get decent performance. Its metadata and slots are in two different arrays, while ska::flat_hash_map has only one slot array. So even if there is a cache miss, ska::flat_hash_map only fetches data from RAM once.

Insert without reserve

When there is no prior reserve, the results are quite similar to those of the data <uint64_t, uint64_t>. This may be because most of the time is spent on allocation and deallocation of memory.

There is one difference: absl::node_hash_map gets a better ranking than in the <uint64_t, uint64_t> data. This shows that when the sizeof(value_type) is large, or when the construction of the value_type is time-consuming, absl::node_hash_map has some advantages because it does not move the value between slots.

Latency

<K,V>: <uint64_t with several split bits masked, uint64_t>

Insert with reserve latency

Insert without reserve latency

We found that even if there is no reserve before inserting, the P99 latency of the insert operation is basically at the same level as the P99 latency with reserve. But theoretically, without reserve, the worst-case time complexity of an insert operation is O(n), because the hash table needs to grow. If a reservation is made in advance, rehash can be avoided when inserting.

Therefore, we can additionally look at the P100 latency (aka max latency) of insert operations. The following two graphs show the insertion P100 latency when the reserve operation is performed in advance and the insertion P100 latency when the reserve operation is not performed.

It can be seen that only the fph family of hash tables and robin_hood::unordered_flat_map have large maximum insertion latency when a prior reserve is performed. The fph tables reach tens of millions of nanoseconds at 10^7 elements (their perfect-hash rebuild is the culprit), and robin_hood::unordered_flat_map climbs to over a million ns. From the experimental data, the worst-case time complexity of insertion in the other hash tables, after reserve, does not appear to be O(n). The P100 insertion latency of ska::flat_hash_map and ska::bytell_hash_map is only about 760 to 830 ns even when the number of elements is 10^7, which is a good result.

When there is no prior reserve, the P100 latency of the insert operation arguably reflects the worst time complexity of that operation on the hash table: O(n). The maximum insertion latency is proportional to the number of elements, reaching the order of 10^8 ns at 10^7 elements for the flat tables. Relatively speaking, absl::flat_hash_map and robin_hood::unordered_flat_map (together with ankerl::unordered_dense_map) have a smaller maximum latency when expanding.

If the user wants stable insert time, then reserve should be performed in advance, which can avoid rehash on most hash tables.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

Insert with reserve latency

Insert without reserve latency

When the size of value_type is 64 bytes, the rankings are very similar to <uint64_t, uint64_t>.

Throughput Appendix

<K,V>: <uint64_t with high position bits masked, uint64_t>

Insert with reserve

Insert without reserve

<K,V>: <uint64_t with low position bits masked, uint64_t>

Insert with reserve

Insert without reserve

<K,V>: <uint64_t uniformly distributed, uint64_t>

Insert with reserve

Insert without reserve

Latency Appendix

<K,V>: <uint64_t with high position bits masked, uint64_t>

Insert with reserve latency

Insert without reserve latency

<K,V>: <uint64_t with low position bits masked, uint64_t>

Insert with reserve latency

Insert without reserve latency

<K,V>: <uint64_t uniformly distributed, uint64_t>

Insert with reserve latency

Insert without reserve latency


← Back to Hash Table Benchmark index