Arpa's blog

By Arpa, 8 years ago, In English

Hi!

I thought that unordered_set is faster than set, because set complexity is logarithmic in size and unordered_set complexity is constant time in average.(source: cplusplus.com)

But today I saw this:

TLE(>2000 MS) with unordered_set: 15494816

Accepted(155 MS) with set: 15494846

Also my other solution(witch I submitted during the contest) with unordered_map hacked :'(

Now my question is : Is unordered_set faster than set?

UPD: I have accepted(15497282).unordered_set time was better than set(15494846) : 93<155.

Only with adding this line:s.max_load_factor(0.25);s.reserve(500);.

UPD2: it seems that it is better to use a power of 2 in reserve(i.e. s.reserve(512)).(Thanks from keyvankhademi)

  • Vote: I like it
  • +9
  • Vote: I do not like it

| Write comment?
»
8 years ago, # |
  Vote: I like it -8 Vote: I do not like it

The default hash function for integers does not perform very well. I suggest trying to look up other hash function alternatives.

In my experience, both unordered set and map have worked well for strings.

  • »
    »
    8 years ago, # ^ |
      Vote: I like it +5 Vote: I do not like it

    Any hash function does not perform well unless it is chosen randomly from a big set.

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it +10 Vote: I do not like it

      there's a difference between functions that work badly when someone want them too and that just work badly.

»
8 years ago, # |
  Vote: I like it 0 Vote: I do not like it

unordered_set is not always faster than set, it's like comparing complexity between quick sort and merge sort.

  • »
    »
    8 years ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    to be fair, quick sort is O(n lg n) in average and O(n^2) in worth case,but marge sort is O(n lg n)in average and worth case.

    But unordered_set is O(1) in average and O(n) in worth case, but set is O(lg n) in average and worth case.

    quick sort and marge sort average is equal, so it is better to use marge sort.

    But because unordered_set average complexity is better than set, it seems that unordered_set must be better.

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it +5 Vote: I do not like it

      adding pos.reserve(500); reduces running time better then tree set for the current tests: http://codeforces.com/contest/620/submission/15496964

      i have no idea though why setting to a much higher value in place of 500 again increases time

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it +5 Vote: I do not like it

      Yeah, sorry for the bad intepretation. But my point still stands, unordered_set complexity is unstable so the matter of is unordered_set is faster than set depends on the test case.

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      it is better to use marge sort.

      Umm, no, it's better to use quicksort. Proof — this text from Skienna, Algorithm Design Manual:

      "What we can say is that experiments show that where a properly implemented quicksort is implemented well, it is typically 2-3 times faster than mergesort or heapsort. The primary reason is that the operations in the innermost loop are simpler."

      Basically quicksort has better cache locality. Most C++ standard library implementations use a combination of quicksort and insertion sort as their implementation for std::sort.

»
8 years ago, # |
Rev. 3   Vote: I like it +8 Vote: I do not like it

Well, that depends on the implementation of lib (And specially-crafted data will cause that. Check the UPD behind). Theoretically and most of the times, the hash ones are faster than tree ones.(And that's why hashing is quite popular in actual work.)

And as it turns out on Microsoft .Net Framework implements, the answer is yes.

15474905 the HashSet one with 218ms.

15496423 the SortedSet one with 311ms.

As for their implementation in Microsoft .Net Framework, it's already open-sourced. corefx/src/System.Collections/src/System/Collections/Generic/HashSet.cs corefx/src/System.Collections/src/System/Collections/Generic/SortedSet.cs

As for the GNU C++ inplementations(hash table implementations in <bits/hashtable.h>, and the default hash function for int is just static_cast(__val) ), I'm surprised to find a try-catch block in it(and not so easy to understand how it works).


And thank you for the heads up. I'll think twice before using unordered_set/unordered_map on GNU C++.


UPD:

The problem just reminds me of a famous security problem: Hash Collision DoS(a hot problem during 2011~2012), which, according to some reports, a lot of popular languages ware(maybe even are still) affected, including PHP<=5.3.8 , Perl, Java, ASP.NET, V8 JavaScript Engine,Ruby <= 1.8.7-p356 ,etc. A specially-crafted set of keys could trigger hash function collisions, which can degrade performance of HashMap or Hashtable by changing hash table operations complexity from an expected/average O(1) to the worst case O(n).

Some information here:https://bugzilla.redhat.com/show_bug.cgi?id=750533

(Some discussion shows that some language just adds some limits to the number of parameters to "hide" the problem inside the implements of hash table strategy (Source: http://stackoverflow.com/questions/8672985/how-was-the-hash-collision-issue-in-asp-net-fixed-ms11-100). You may want a try to hack my C# solution. )

»
8 years ago, # |
  Vote: I like it +5 Vote: I do not like it

Sometimes it is.

I have once got a problem AC by changing ordered_set to unordered_set in C++.

However it depends on the hashing. I always go with ordered set, and if it's tle and optimization, I try the unordered one.

»
8 years ago, # |
Rev. 2   Vote: I like it +11 Vote: I do not like it

Looks like someone went through the trouble to generate an extremely bad test case input for hacking 620C - Жемчужинки. Now the hack is included in the system tests as Test 42. It causes a huge number of collisions in that specific G++ implementation of unordered map.

  • a submission with unordered_map fails with TL: 15496737
  • a submission with map gets AC (only one line of code changed!): 15496747
  • a submission with unordered_map and custom hash function: AC again! 15497036. Bit slower compared to map, anyway.

To the OP: good idea to change the max load factor! Looks like a simple solution compared to writing a custom hash function.

Tweaking the load factor + pre-creating many buckets (with reserve) solves the problem because you only get bad performance when there are many items per bucket. The worst-case is when all items are in a single bucket: then lookup in the hash map is reduced to list lookup.

  • »
    »
    8 years ago, # ^ |
    Rev. 2   Vote: I like it +14 Vote: I do not like it

    Yes I'm afraid that was me.

    To sum it up, unordered_set is generally much faster than set. Set gurantees O(log n) but unordered_set is O(1) in average. However while set ensures O(log n), unordered_set can take O(n) time per insertion. This is very rare in practice unless the dataset is deliberately chosen so that unordered_set causes large number of collisions, degrading to O(n).