RodionGork's blog

By RodionGork, 10 years ago, translation, In English

It is a well-known problem for beginners (after learning searching minimum through array): to choose M smallest elements of an array of size N.

Here is an evolution of this problem on bigdata scale:

We have terabytes of log-files, say with N records. Each record contains value for some parameter A (among other auxiliary data). We want to choose M (say, million) records with smallest values of this parameter. How to manage with it?

There are two difficulties:

  • source data could not be stored into RAM, so we could not access arbitrary element in O(1) — instead we rely on sequential processing;
  • algorithm should be parallelizable since N is about 1e12 and single pass with a single thread will took about a day (however source data are stored in distibuted file system and are accessible by fragments of about 100Mb - 1Gb so several processors may work with different fragments simultaneously).

What approaches there are?

I came up with few basic ideas:

  1. Go through source data and store them into binary heap of maximum size M — i.e. next element is skipped if is greater than maximum of the heap — and if it is otherwise stored, then the maximum is popped out of a heap to preserve the size. It will work with about N + M*log(M) but I know not how to parallelize the binary heap... So probably it will not do.

  2. If M itself is so large that could not be stored in RAM we can store elements in the list instead of binary heap. After list grows to size 2M for example, we sort it and cut to the size M throwing away unnecessary elements and proceeding. This is well paralelizable but speed is about N*log(M) I think.

  3. I like stochastic variant: just pass through data and choose all for which A is less than some threshold k. How to determine this threshold? For example we can make pre-pass and choose randomly as many records as could be stored into RAM. Then chose among them few minimal (in proportion to M/N) and assign the maximum of these minimums as threshold k. Of course we can end up with few or less than M elements (but we can artificially increase k and at the end throw away the surplus).

However all these variants are not extremely good. Could you propose something better, please?

  • Vote: I like it
  • +4
  • Vote: I do not like it

| Write comment?