Hash Table Benchmark - Integer Erase and Insert

The integer key erase and insert test.

Click the labels on the legend to hide or show the data lines for specific hash tables and hash functions in the figure.

In this test, we first construct a hash table with size N, and then repeat the following operations M times:

  1. Insert a new element into the hash table
  2. Randomly erase an element from the hash table.

It should be noted that the results of this test are also highly correlated with the distribution of the data, especially the relationship between deleted data and inserted data. Here we randomly select elements to delete with equal probability. In reality, elements may not be selected with equal probability; for example, the most likely deleted element may be the most recently inserted element.

Throughput

We record the time spent in the whole process, which includes both insert and erase operations.

The y axis value is the average time per operation. This result is obtained by time/op = (time for insert + time for erase) / (2 * M). This is the average time taken for insert and erase. This number reflects the efficiency of making modifications to the hash table.

<K,V>: <uint64_t with several split bits masked, uint64_t>

On Intel Rocket Lake, ska::flat_hash_map with std::hash is the fastest or near-fastest across almost the whole range (about 5 to 37 ns per operation). emhash::hash_map7 with absl::Hash and absl::flat_hash_map with absl::Hash are close behind in the medium range. tsl::robin_map is fast at small sizes and again at very large sizes (about 39 ns at 10^7), but it slumps in the medium range (rising to 60 to 90 ns near 400,000 to 800,000) because of the poorer distribution of robin_hood::hash on these key patterns.

On Apple M1 Max, the combination of ska::flat_hash_map and std::hash has a comparative advantage in almost every data scale (about 6 to 33 ns).

It is also worth noting that on the M1 Max chip, absl::node_hash_map shows a large performance degradation, with a pronounced bump in the data range from about 45,000 to 100,000 elements (peaking near 135 ns). std::unordered_map also exhibits performance degradation, climbing steadily and peaking around 100,000 to 300,000 elements (up to about 600 ns). It is not clear what the cause of this degradation is. It may be related to the system's memory allocation policy, as both hash tables require memory allocation and recycling operations for each insertion and deletion. This phenomenon is more pronounced on datasets with <uint64_t, 56 bytes>.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

On Intel Rocket Lake, the rankings are similar to those of <K,V>: <uint64_t with several split bits masked, uint64_t>: ska::flat_hash_map with std::hash keeps a comfortable lead (about 8 to 59 ns), and absl::flat_hash_map is the closest follower.

On M1 Max, the relative performance of both absl::flat_hash_map and emhash::hash_map7 increases a bit when the number of elements is in the range of roughly 32,768 to 1,200,000, where emhash::hash_map7 with absl::Hash actually edges ahead of ska::flat_hash_map at several points.

Latency

We record the latency of insert and erase operations separately. The insert latency here is different from that in the "Insert and Construct" test. The latency statistics in the construct test include all operations from size 0 to size N, while in this test the size of the hash table is always N or N + 1.

Insert (after erase)

<K,V>: <uint64_t with several split bits masked, uint64_t> Insert Latency

First consider P50 latency. When the data size is small, both ska::flat_hash_map with std::hash and tsl::robin_map with absl::Hash are in the first tier in terms of speed (about 7 to 9 ns). When the number of elements is larger, absl::flat_hash_map and absl::node_hash_map with absl::Hash have a greater advantage: past roughly 300,000 elements the ska/tsl tables jump up (to about 50 ns and beyond) while the absl tables stay near 24 to 28 ns. In addition, the median latency of most open-addressed hash tables converges again as the number of elements approaches 10^7 (around 92 to 99 ns).

For P99 latency, emhash::hash_map7 with absl::Hash has the smallest tail latency through small and medium sizes (about 31 ns at small counts, staying ahead up to roughly 200,000 elements). When the number of elements is larger than that, absl::flat_hash_map comes out on top.

Another point that has to be mentioned about modifying open-addressed hash tables is the tombstone mechanism used in the implementation. For some hash tables, when a delete operation is performed, a special marker (tombstone) is placed on the slot where the deleted element was located. A tombstone marker is not the same as an empty marker. If a hash table has too many tombstone markers, its lookup performance will be affected. Therefore, some hash tables will rehash when the number of tombstones reaches a certain percentage. This can give an insert operation mixed with erase a poor maximum latency. The P100 latency helps show this.

In addition to the element counts that are powers of 2 (where the first insertion may lead to expansion), some hash tables have P100 latency at some data points proportional to the number of elements. This phenomenon is observed for both the absl-series hash tables and robin_hood::unordered_flat_map. These hash tables should not be selected if the user requires strict maximum latency for modification operations.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct> Insert Latency

When the size of value_type becomes 64 bytes, the advantage of ska::flat_hash_map with std::hash over tsl::robin_map increases when the amount of data is small (about 8 ns vs 11 ns at the smallest sizes for P50).

For P99 latency, at small element counts the smallest tail belongs to absl::flat_hash_map with absl::Hash (about 33 ns), ahead of ska::flat_hash_map and emhash::hash_map7.

Erase

<K,V>: <uint64_t with several split bits masked, uint64_t> Erase Latency

For P50 latency, ska::flat_hash_map with std::hash almost always has the smallest latency, except in the range from about 300,000 to 800,000 elements. In that range ska::flat_hash_map jumps up (to about 100 to 160 ns) because its aggressive expansion and low load factor push it into a slower memory tier, while absl::flat_hash_map and robin_hood::unordered_flat_map stay lower (about 45 to 120 ns) and perform relatively better. Above 1,200,000 elements the tables converge again.

For P99 latency, emhash::hash_map7 with absl::Hash performs the fastest when the number of elements is small (about 19 ns, leading up to roughly 3,000 elements). The tail latency of absl::flat_hash_map with absl::Hash is smaller in the medium range, roughly from 8,000 to 200,000 elements. When the number of elements is larger, the performance of many hash tables is close.

<K,V>: <uint64_t with several split bits masked, 56 bytes struct> Erase Latency

When the size of the value_type is 64 bytes, it is basically the same as <uint64_t, uint64_t>.

Throughput Appendix

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

Latency Appendix

Insert (after erase)

<K,V>: <uint64_t with high position bits masked, uint64_t> Insert Latency

<K,V>: <uint64_t with low position bits masked, uint64_t> Insert Latency

<K,V>: <uint64_t uniformly distributed, uint64_t> Insert Latency

Erase

<K,V>: <uint64_t with high position bits masked, uint64_t> Erase Latency

<K,V>: <uint64_t with low position bits masked, uint64_t> Erase Latency

<K,V>: <uint64_t uniformly distributed, uint64_t> Erase Latency


← Back to Hash Table Benchmark index

  • Author: Renzibei
  • Link: https://renzibei.com/en/int-erase-insert/
  • Copyright Notice: All articles on this website, unless otherwise stated, are created by the author and are licensed under Creative Commons License CC BY-NC-ND 4.0 License. Please observe the agreement and cite the source when quoting and reproducing.