TODO: Open addressing hash table on C++

#	User	Rating
1	jiangly	3640
2	Benq	3593
3	tourist	3572
4	orzdevinwang	3561
5	cnnfls_csy	3539
6	ecnerwala	3534
7	Radewoosh	3532
8	gyh20	3447
9	Rebelz	3409
10	Geothermal	3408

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

Hello, Codeforces!

In late December and early January, I wrote a proof-of-concept of separate service in C ++, to take out heavy data from the Codeforces Java-code to C++. Long time I have not written in C ++, experienced a funny sense of immersion into another world.

I was surprised to find the lack of open-addressing hashmap in the C++ standard library, and indeed in the boost, and in other popular libraries. It is strange somehow, because often open addressing should be better than separate chaining both in time and memory. And since I intend to keep small objects, then sure.

I quickly sketched a prototype that actually shows that open addressing in 2-3 times faster than the standard std::unordered_map. Here's the output of the benchmark on my laptop:

std :: map takes 15779 ms
std :: unordered_map takes 4698 ms
oaht :: hash_map takes 1473 ms

I think that there is no good implementation of such container in stl-style with the support of C++11 (move semantics). Maybe someone will show, but I have not found.

Here is my prototype on github: https://github.com/MikeMirzayanov/open_addressing_hash_table Unfortunately, I'm not cool C ++ expert, and I have no time to implement.

On the other hand, on Codeforces where were many C ++- discussions, and seems there many users who understand modern C ++. Also algorithms are water and air of Codeforces! I have hope that the universal mind Codeforces and straight hands of someone in the community will help bring to life such a container. Of course, we want maximum compatibility with std::unordered_map (API and semantics should be equivalent where it is possible). Here is a sample list of features that I want to see:

One more template type parameter class Pred = std::equal_to<Key>
Member function at
Member function clear
Member function empty
Member function insert
Member function equal_range
Member function swap
Member function operator=
Member functions load_factor, max_load_factor, rehash, reserve
Constant members
Member types: difference_type, hasher, key_type, mapped_type, reference, size_type, value_type, pointer, const_reference, key_equal
Iterators: member types iterator, const_iterator, member functions begin, end, cbegin, cend
Member function emplace
Member function emplace_hint
Relational operators
Non-member function overload: swap
Move semantics

In addition, I want to keep it working on popular C ++ 11 compilers: g ++, Visual Studio 13+, clang. Please try to follow the code style of oaht.h.

I'll be happy to delegate the management of this project to someone who has the enthusiasm, knowledge of C ++ and have proven experience in the problems. I'm looking for volunteer! Codeforces t-shirt and +5 to your honor guaranteed.

P.S. As a result we want to have a good implementation, so there is a desire to put into the project only correct and well-written code.

oaht::hash_map

__gnu_cxx::hash_map

500000

204 ms

374 ms

1000000

408 ms

841 ms

2000000

860 ms

1760 ms

4000000

1731 ms

3787 ms

8000000

3956 ms

8250 ms

dense_hash_map requires you to set aside one key value as the 'empty bucket' value, set via the set_empty_key() method. This *MUST* be called before you can use the dense_hash_map. It is illegal to insert any elements into a dense_hash_map whose key is equal to the empty-key.

Comments (15)

Write comment?

vfleaking

9 years ago, # |

+64

Will we be able to use oaht::hash_map h; after #include "codeforces.h" on codeforces in the future? It sounds very interesting :)

→ Reply

mc_mosa

← Rev. 3 →

Have you tried SGI hash_map?

Unfortunately I couldn't find the collision resolution method used in it, but I've tried your benchmark.cpp on my laptop and it shows a better performance (than the std::unordered_map). So you might want to check it out.

Header file: <ext/hash_map>

Namespace: __gnu_cxx

Edit1: It's not SGI's hash_map, I misinterpreted this comment in <ext/hash_map>

/** @file backward/hash_map
 *  This file is a GNU extension to the Standard C++ Library (possibly
 *  containing extensions from the HP/SGI STL subset).
 */

Edit2: After some digging into the header files, I found that it uses closed addressing method for collision resolution.

You can find that in the header file <backward/hashtable.h> function name ... insert_equal_noresize(...).

MikeMirzayanov

9 years ago, # ^ |

It seems it has worst performance than open addressing hash map proof-of-concept:

n	`oaht::hash_map`	`__gnu_cxx::hash_map`
500000	204 ms	374 ms
1000000	408 ms	841 ms
2000000	860 ms	1760 ms
4000000	1731 ms	3787 ms
8000000	3956 ms	8250 ms

Kenny_HORROR

Have you tried Google Dense Hashmap? Though it might not 100% matching your specs, its a lot faster then stl containers and really nicely written.

https://code.google.com/p/sparsehash/

Looks promising, thanks. But I've tried test<GOOGLE_NAMESPACE::sparse_hash_map<int64_t,int64_t>>("GOOGLE_NAMESPACE::sparse_hash_map"); and it works ~2000 ms while open addressing hash map proof-of-concept ~400 ms. Right now I can't use it MinGW (can't compile), using in MS VS 2013.

mogers

I think dense_hash_map is more time efficient while sparse_hash_map is more space efficient. Can you try dense_hash_map?

I checked and default dense_hash_map is about 2x slower on my machine.

I wonder how much of this difference is due to the fact that your proof of concept supports few operations, namely only inserts but not deletes.

Did you use benchmark code from my github repository? So dense_hash_map wasn't tested on erases as well. I do not see that support of more functionallity in oaht will reduce performance.

Also in my case I will not use erases often.

How to use dense_hash_map? I've got Assertion failed: settings.use_empty(), file google\sparsehash/internal/densehashtable.h, line 476.

Yes, I cloned your github code.

Also had that error with dense_hash_map, but it is explained in the readme:

So, I created a new test with

// Init
google::dense_hash_map<int64_t, int64_t> m;
m.set_empty_key(1LL << 62);
// do operations

Sorry for the multiple replies. Just noticed that dense_hash_map also uses open addressing with quadratic probing for collisions resolution.

So, I think these efficiency differences are more about the particular data set or hash function used, rather than the hash map implementation.

I believe that +1 strategy could be better than quadratic probing or double hashing because of better cpu cache usage.

I think it depends on the data set. Linear probing has the risk of clumping, having too many adjacent buckets occupied.

If you feel this is unlikely to occur in the data you'll use, then it's probably better.

My feeling is that libraries like dense_hash_map are designed to minimize the probability of an adversarial input, which may not be the best solution for your case.

It depends on load factor. Linear probing has worse function cluster_length(load_factor). Because of hashing it doesn't depend much on input (except anti-hash cases).

Currently, I'm looking into Walrus's implementation on https://code.google.com/p/morzhovets/source/browse It looks as a good trade off between time and memory. Much better than std::unordered_map.

ffao

I remember a long time ago I tried to implement my own oaht and it was also considerably faster than dense_hash_map for all benchmarks. I was using double hashing to resolve collisions.

For some reason, in the case of hash maps, it seems the more generic the code is, the less-performant it becomes. Maybe it has something to do with working well against most datasets, or is really just a load factor issue, I don't know.

swust_20131737

-26

How to use it on Codeforces? What compiler should be chosen and #include what file(s) I got lots of CE……

MikeMirzayanov's blog