ssvb's blog

By ssvb, 5 weeks ago, In English

Right after the end of Facebook Hacker Cup 2021 Round 1, wjomlex posted the following comment in the discussion section under the announcement blog:

A common issue was people writing O(N^2) solutions to problem A when O(N) solutions are necessary. You won't be able to run an O(N^2) solution in 6 minutes when N = 800,000.

But is this really true? Challenge accepted!

Problem A1
Spoiler
Problem A2, solution v0

Reusing code from the problem A1 solution allows to easily create an $$$O(N^3)$$$ solution for problem A2 by just iterating over all the possible substrings, calling function "solve_a1" for each of them and accumulating the result. But we need $$$O(N^2)$$$ to satisfy the conditions of the challenge and this can be achieved by applying a minor adjustment, transforming "solve_a1" into "solve_a2l":

Solution v0 for problem A2

Unsurprisingly, this solution is way too slow and needs 28 minutes to process the practice mode data even on my desktop computer with a quad-core Intel Core i7-860 processor @2.8GHz:

Spoiler
Problem A2, solution v1 (OpenMP pragma)

A good thing is that the performance can be very easily improved by just taking advantage of more CPU cores:

Solution v1 for problem A2 (use OpenMP pragma)

Just a single line #pragma omp parallel for before a single loop helps a lot. The reduction(+:ans) part is also needed to tell OpenMP how to combine results from each loop iteration (without it we would need to store these results in a temporary buffer and then add them together in an extra $$$O(N)$$$ loop). The accumulator variable is upgraded to int64_t to avoid overflows. Additionally, the schedule(guided) part of the pragma and the reversed order of processing is there to keep all CPU cores busy and improve performance. The compiler command line needs to include -fopenmp option.

Below are benchmarks on my desktop computer, old laptop and Samsung Galaxy Tab S6 Lite tablet:

Spoiler
SolutionDeviceProcessorTime
solution v0desktop computerquad-core Intel Core i7-860 @2.8GHz28m13.998s
solution v1 (OpenMP)desktop computerquad-core Intel Core i7-860 @2.8GHz6m31.028s
old laptopdual-core Intel Core2 Duo T7200 @2.0GHz25m39.187s
android tabletocta-core 4x ARM Cortex-A73 @2.3GHz + 4x ARM Cortex-A53 @1.7GHz8m34.429s
Running it all on a homemade "cluster"

I have multiple devices at home and none of them is fast enough by itself. But what if they all work together? It isn't very difficult to make a script, which dissects the input data into separate testcases, runs them on different devices and combines their output together:

Ruby script
Results

Now the time is down to 4m45.626s and this is within the 6 minutes limit. Assigning jobs to different devices could be surely improved. Right now I'm just sorting them by the lenghts of strings and processing more difficult cases first. The old laptop was totally useless and couldn't handle even a single big testcase #17. The other devices came to the rescue in the end and finished it faster.

TL;DR;

Facebook Hacker Cup has an unusual format. The contestants have 6 minutes to run their solutions on their own computers. This provides a lot of opportunities to speed up even poor time complexity solutions by taking advantage of multiple CPU cores and multiple computers. OpenMP is easy to use.

To be continued

This is not the end, but the blog is already becoming too large and it's necessary to wrap it up. The next part will feature Clang vs. GCC, different compiler optimization options and taking advantage of SSE2/AVX2.

Spoiler
 
 
 
 
  • Vote: I like it
  • +203
  • Vote: I do not like it

»
5 weeks ago, # |
  Vote: I like it +16 Vote: I do not like it

Really nice blog, I remember people talking about doing this but not having a concrete example. Glad to see someone actually went ahead and did it.

Just a side-note: the use of a separate script sounds nonstandard (but it works around some issues with heterogeneous systems, so it's really nice). As far as I know, there's a way to use MPI (even as a hybrid with OpenMP) over a network (even though people claim heterogeneous support is broken, from what I've heard, it tends to work for Android on ARM and Linux on x86-64), which could allow for even better distribution of tasks across nodes (perhaps in terms of load balancing), though this doesn't sound as necessary for this problem.

»
5 weeks ago, # |
  Vote: I like it +23 Vote: I do not like it

See? This is what we mean when we talk about problem solving. Don't let anybody tell you what you can't do.

»
5 weeks ago, # |
  Vote: I like it +16 Vote: I do not like it

Nice. People run solutions on multiple machines in GHC but I've never heard about using a tablet. It seems that it would be enough for you to just split test cases in half between PC and tablet, right? No need for any smart assignment.

I used a less complicated set-up several times already. Split the test cases into 8 groups and run in parallel, no pragmas. It even got me to FHC finals once — I squeezed a slow $$$O(N \cdot \log^2 N)$$$ in 5 minutes.

Sadly, all of this rewards having a strong computer or using an online cluster. I would love to see this format in onsite competitions though, where all contestants have the same computing power.