So recently I was interviewing for a big data engineer role, I had cleared the first round which was just DSA, SQL, and discussion on the big data tech that I had worked on, in the second round I came across an interesting problem so I wanted to share it with you all, it was same as 2 sum but i was expected to use map-reduce:

Problem statement:

There are a large number of numbers residing on multiple nodes (storage + commodity compute), you have one main compute server where you can perform your computing, find all distinct pairs which equal to a sum of "target" and the count of those pairs, now you can't load all the numbers on your main compute as you do not have enough storage to do so.

find all x+y=target

The solution the interviewer expected:

approach: Map-Reduce:

send a map program to all the nodes where the map function calculates the following :

(number,count(number))

next is the reduce phase where we use the following hashing algorithm in order to assign the above-calculated info in various nodes into buckets in our main compute

The bucket a given number is assigned to, is calculated as follows: min(number, target-number)

in that way, a pair, if they have a sum equal to target, will be assigned to the same bucket and we can compute our answers easily (edge case for equal numbers but yeah that can be handled somehow)

Now I know this question has a bunch of loop holes but the I hope the intent of the question was understood.

This was the first time I came across such a problem, can someone recommend me resources to learn/practice such problems?