Romka's blog

By Romka, history, 6 years ago, translation, In English

Hello everyone! :)

In case you missed our e-mail, we inform you that the second round of the Yandex.Algorithm Optimization track has been started today at 10:00 UTC +3. Like the first round, it will last for 7 days. There is one task which does not have the complete solution that can be found in a few seconds, but you can try different approaches to score as many points as possible. According to the results of the two rounds (see section 3 in https://contest.yandex.com/algorithm2018/rules/ — rules of combining results are the same for algorithmic and optimization tracks) we will determine the winners of the Optimization track and 128 future owners of the t-shirts with the logo of the competition.

We have prepared for you a task that simulates the web crawling performed by the search robot. We offer you to experience the complexity of indexing sites and try to optimize the crawling of the Internet — this is just one of the problems that software engineer solve at Yandex. Of course, the task was quite simplified, because you have only a week to solve it, but can you come up with good solution how to assign sites to robots optimally?

Registration for the contest is still open, so you can join at any time — good luck!

  • Vote: I like it
  • +66
  • Vote: I do not like it

»
6 years ago, # |
Rev. 2   Vote: I like it +25 Vote: I do not like it

Hi! When the final testing results are going to be published?

Also I really wonder what were the solutions that reach 8500+ on current standings. Could someone share one of those ideas, please?

Out of curiosity, one more question regarding the real-world application inspiration of the problem statement. I understand that it would make sense for robots to follow some kind of a repeating list of sites — to save memory, for example. However, why do robots' paths have to be cycles in the links graph and not just a repeating list of any sites, not necessary connected in a cycle?

  • »
    »
    6 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Final standings are published!

  • »
    »
    6 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Some 8300+ ideas:

    • 8150+: greedy solution constructing cycles one by one: pick starting site randomly and choose next site that has a topic with minimum current "score contribution" (*timeToIndex[site] / newsProb[site]). When I add a new site to a cycle I update scoreContribution(topic) += newsProb[site] / timeToIndex[site] for all topics published on that site. When there are enough (C = n/r/2) sites per cycle I use Dijkstra to finish the cycle (C doesn't seem to matter too much)

    • 8250+: repeat previous algorithm multiple times and pick the "best" solution. ("best" is not always the best because there is a lot of variance in the scores because of randomness)

    • 8300+: pick the topic TM with the lowest score and try the previous greedy algorithm ignoring TM (explanation: if a topic doesn't bring any points let the robots focus on other topics)

    • 8350: Fix one TLE by stopping slightly earlier :)

    I suspect one can get to 8500+ by speeding up the simulation step (currently only 10-30 random starting points can be tested because of slow list/sort/vector operations) and using an incremental approach where some sites are added only if they increase the overall score.

»
6 years ago, # |
  Vote: I like it 0 Vote: I do not like it

If I didn't miss anything you can see your final score in your last submission before results will be published

  • »
    »
    6 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Nice hacker skills! That's not the final scoreboard, but it's something.

    Now I know that my solution has Runtime Error only on 1 test out of 500. I guess I got that going for me which is nice. :D