2 sum problem expanded to distributed systems/map reduce

Revision en2, by Saucemaster102, 2023-04-04 20:13:26

So recently I was interviewing for a big data engineer role, I had cleared the first round which was just DSA, SQL, and discussion on the big data tech that I had worked on, in the second round I came across an interesting problem so I wanted to share it with you all, it was same as 2 sum but i was expected to use map-reduce:

Problem statement:

There are a large number of numbers residing on multiple nodes (storage + commodity compute), you have one main compute server where you can perform your computing, find all distinct pairs which equal to a sum of "target" and the count of those pairs, now you can't load all the numbers on your main compute as you do not have enough storage to do so.

find all x+y=target

The solution the interviewer expected:

approach: Map-Reduce:

send a map program to all the nodes where the map function calculates the following :

(number,count(number))

next is the reduce phase where we use the following hashing algorithm in order to assign the above-calculated info in various nodes into buckets in our main compute

The bucket a given number is assigned to, is calculated as follows: min(number, target-number)

in that way, a pair, if they have a sum equal to target, will be assigned to the same bucket and we can compute our answers easily (edge case for equal numbers but yeah that can be handled somehow)

Now I know this question has a bunch of loop holes but the I hope the intent of the question was understood.

This was the first time I came across such a problem, can someone recommend me resources to learn/practice such problems?

Tags dsa, interview, big data, map reduce

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en2 English Saucemaster102 2023-04-04 20:13:26 17 Tiny change: 'ed info into buck' -> 'ed info in various nodes into buck'
en1 English Saucemaster102 2023-04-04 19:50:56 1661 Initial revision (published)