Hash Table Benchmark - Integer Insert and Construct
The integer key insert test.
Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.
In this test, we measure the time spent constructing the hash table.
The construction is done with the
emplace( const value_type& value ) operation. We test
both with and without a reserve before inserting. The time taken by
reserve is not counted in the total time.
We originally used the
insert( const value_type &value) function to build the
hash table, but we found that some hash tables do not use
std::pair<const Key, T> as their
value_type (the type used by the STL container). As a
result, we cannot use the insert function uniformly across
all the tables we test.
During the test, each hash table is constructed many times, and we record the total insertion time in the throughput test.
Throughput
The y axis value is the average time per operation. This result is
obtained by
time/op = sum{construct time} / (number of construct * number of elements).
<K,V>: <uint64_t with several split bits masked, uint64_t>
Insert with reserve
When analyzing insert performance after reserve, the
first obvious thing is that some hash tables, such as
emhash::hash_map7, tsl::robin_map,
absl::node_hash_map and absl::flat_hash_map,
are not suitable for use with std::hash. They need a better
hash function. Of course, there are some hash tables that haven't
exposed this problem in the insert test yet, and they will show this
problem in the later lookup test.
When we remove these hash tables that do not use the appropriate hash
function (the reader can click the labels in the legend to hide the data
points of these hash tables), we can see that the slowest hash tables to
insert are the perfect-hash fph::DynamicFphMap and
fph::MetaFphMap, by a large margin. This is the honest
tradeoff for their fast lookups: building a perfect hash is expensive,
and the cost grows with the number of elements. On the Xeon E-2388G, the
fph tables already take 37 to 51 ns per insert at 1024 elements (vs
about 20 ns for std::unordered_map and 4 to 6 ns for the
fastest flat tables), so even with a small table the gap is real. As the
element count grows the gap widens further: at 10^7 elements the fph
insert time is about 1450 to 1900 ns, roughly 12 to 15 times that of
std::unordered_map (which sits near 125 ns) on the Intel
CPU, and about 11 to 12 times on the Apple M1 Max (where the fph tables
are near 1550 ns and std::unordered_map near 134 ns).
In addition, we unexpectedly found that the combination of
std::unordered_map and std::hash is
surprisingly slow for some data points on the M1 Max. Further
observation shows that the number of elements of these data points is
exactly a power of two: at 1024 elements this combination needs about
165 ns per insert, at 2048 about 311 ns, and at 8192 it explodes to
roughly 2500 ns, while at the neighboring non-power-of-two sizes (e.g.
800, 1500, 3000, 6000) it stays around 20 to 32 ns. We infer this comes
from the libc++ unordered_map implementation used by clang
on the M1: it takes the hash value modulo the number of buckets to
obtain a slot index. When the bucket count is exactly a power of two,
that modulo degenerates into keeping only the low-order bits of the hash
and discarding the high-order bits. Since std::hash for
integers is the identity function, any entropy that lives in the high
bits is thrown away, which produces heavy collisions at exactly the
power-of-two sizes.
When looking for the fastest hash table for insert operations, the results are a little different on the x86-64 and arm64 architectures.
On Intel Rocket Lake, several flat tables are essentially tied at
small sizes: emhash::hash_map7 with
absl::Hash, ska::flat_hash_map with
std::hash, and ankerl::unordered_dense_map all
sit around 4 to 5 ns per insert when the element count is no larger than
a few thousand. In the medium range, absl::flat_hash_map
with absl::Hash is the most consistent winner: it stays
near 6 to 12 ns from 6,000 up to about 800,000, where the other flat
tables start to climb faster. At the largest sizes the picture shifts
again, and in an irregular way. tsl::robin_map with
robin_hood::hash is actually the fastest at 10^7 elements
(about 25 ns), with ankerl::unordered_dense_map and
ska::flat_hash_map close behind (about 27 to 28 ns). Yet
the very same combination slumps in the 200,000 to 1,200,000 range
(rising to about 30 to 42 ns) before recovering — a non-monotonic
profile that suggests robin_hood::hash produces a weaker
distribution for these masked-bit keys in that range rather than a clean
cache-tier effect. robin_hood::unordered_flat_map, on the
other hand, does not suffer from this hash-quality problem: it stays
competitive with either absl::Hash or
robin_hood::hash. This means that though it requires a good
hash function, it does not rely on the hash quality as much as other
hash tables like absl::flat_hash_map.
On Apple M1 Max, the story is a little different. When the element
count is no larger than about 6,000, ska::flat_hash_map
with std::hash is the fastest, dipping below 4 ns per
insert. When the number of elements is larger than 6,000,
absl::flat_hash_map with absl::Hash takes the
lead, staying around 4.5 to 10 ns all the way to about 1,200,000 while
the others trail. But when the number of elements approaches 10^7, the
gap between these flat maps shrinks and they converge to roughly 20 to
22 ns. This is because the working set no longer fits in cache when
there are many elements. This phenomenon can be observed more clearly in
the <K,V>: <uint64_t with
several split bits masked, 56 bytes struct> data because cache is
under more pressure in that situation.
Insert without reserve
When there is no reserve operation before insert, the
rankings of these hash tables change again.
On Intel Rocket Lake, absl::flat_hash_map with
absl::Hash is almost always the fastest across all data
scales (about 16 to 44 ns), with emhash::hash_map7 and
ankerl::unordered_dense_map close behind. When the number
of elements is small, tsl::robin_map and
ska::flat_hash_map are also nearly as fast, but they fall
back in the larger ranges because their aggressive growth strategy
triggers more rehashing.
On Apple Silicon, ska::flat_hash_map with
std::hash is the fastest at the smallest sizes (about 14 ns
at 1024). When the number of elements grows beyond a few thousand,
absl::flat_hash_map with absl::Hash again
becomes the fastest, staying around 20 to 32 ns across the rest of the
range. emhash::hash_map7 with absl::Hash comes
a close second.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
Insert with reserve
The pattern of keys is the same as the <K,V>: <uint64_t with several split bits masked, uint64_t> . The difference is that the size of the pair is 4 times larger: 64 bytes per element instead of the original 16 bytes. Therefore, the working set runs out of cache earlier.
On Intel Rocket Lake, at small sizes
ankerl::unordered_dense_map,
ska::flat_hash_map with std::hash and
absl::flat_hash_map with absl::Hash are all
close (about 5 to 8 ns). In the medium range,
absl::flat_hash_map with absl::Hash is the
fastest, staying near 10 to 15 ns up to about 200,000 elements. But at
the largest sizes absl::flat_hash_map falls behind: at 10^7
it costs about 65 ns while ska::flat_hash_map (about 43 ns)
and tsl::robin_map with absl::Hash (about 40
ns) are clearly faster. This is because absl::flat_hash_map
keeps its metadata and slots in two separate arrays, so once the working
set no longer fits in cache it pays two memory loads per probe; the
larger 64-byte element makes this penalty more visible.
On Apple Silicon, ska::flat_hash_map with
std::hash is the fastest when the number of elements is
small (no larger than about 6,000). For the medium-to-large range,
absl::flat_hash_map with absl::Hash takes the
lead, holding it up to about 1,200,000 elements. When the element count
is higher than that, ska::flat_hash_map again pulls roughly
even with or ahead of absl::flat_hash_map. This is because
when the data reaches this scale, the M1 Max runs out of cache, and
absl::flat_hash_map relies on cache to get decent
performance. Its metadata and slots are in two different arrays, while
ska::flat_hash_map has only one slot array. So even if
there is a cache miss, ska::flat_hash_map only fetches data
from RAM once.
Insert without reserve
When there is no prior reserve, the results are quite
similar to those of the data <uint64_t, uint64_t>.
This may be because most of the time is spent on allocation and
deallocation of memory.
There is one difference: absl::node_hash_map gets a
better ranking than in the <uint64_t, uint64_t> data.
This shows that when the sizeof(value_type) is large, or
when the construction of the value_type is time-consuming,
absl::node_hash_map has some advantages because it does not
move the value between slots.
Latency
<K,V>: <uint64_t with several split bits masked, uint64_t>
Insert with reserve latency
Insert without reserve latency
We found that even if there is no reserve before inserting, the P99 latency of the insert operation is basically at the same level as the P99 latency with reserve. But theoretically, without reserve, the worst-case time complexity of an insert operation is O(n), because the hash table needs to grow. If a reservation is made in advance, rehash can be avoided when inserting.
Therefore, we can additionally look at the P100 latency (aka max latency) of insert operations. The following two graphs show the insertion P100 latency when the reserve operation is performed in advance and the insertion P100 latency when the reserve operation is not performed.
It can be seen that only the fph family of hash tables and
robin_hood::unordered_flat_map have large maximum insertion
latency when a prior reserve is performed. The fph tables reach tens of
millions of nanoseconds at 10^7 elements (their perfect-hash rebuild is
the culprit), and robin_hood::unordered_flat_map climbs to
over a million ns. From the experimental data, the worst-case time
complexity of insertion in the other hash tables, after reserve, does
not appear to be O(n). The P100 insertion latency of
ska::flat_hash_map and ska::bytell_hash_map is
only about 760 to 830 ns even when the number of elements is 10^7, which
is a good result.
When there is no prior reserve, the P100 latency of the insert
operation arguably reflects the worst time complexity of that operation
on the hash table: O(n). The maximum insertion latency is proportional
to the number of elements, reaching the order of 10^8 ns at 10^7
elements for the flat tables. Relatively speaking,
absl::flat_hash_map and
robin_hood::unordered_flat_map (together with
ankerl::unordered_dense_map) have a smaller maximum latency
when expanding.
If the user wants stable insert time, then reserve should be performed in advance, which can avoid rehash on most hash tables.
<K,V>: <uint64_t with several split bits masked, 56 bytes struct>
Insert with reserve latency
Insert without reserve latency
When the size of value_type is 64 bytes, the rankings
are very similar to <uint64_t, uint64_t>.
Throughput Appendix
<K,V>: <uint64_t with high position bits masked, uint64_t>
Insert with reserve
Insert without reserve
<K,V>: <uint64_t with low position bits masked, uint64_t>
Insert with reserve
Insert without reserve
<K,V>: <uint64_t uniformly distributed, uint64_t>
Insert with reserve
Insert without reserve
Latency Appendix
<K,V>: <uint64_t with high position bits masked, uint64_t>
Insert with reserve latency
Insert without reserve latency
<K,V>: <uint64_t with low position bits masked, uint64_t>
Insert with reserve latency
Insert without reserve latency
<K,V>: <uint64_t uniformly distributed, uint64_t>
Insert with reserve latency
Insert without reserve latency
- Link: https://renzibei.com/en/int-insert-construct/
-
Copyright Notice:
All articles on this website, unless otherwise stated, are created by the author and are licensed under
CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.