Google Code Jam Difficulty Estimation — 2021 Qualification Round

Revision en3, by areo, 2021-04-02 15:23:43

Have you been wondering what is the difficulty of Code Jam problems on a codeforces scale? me too.

I make a simple estimate for the 2021 qualification round (link), and I plan to do it also for the upcoming rounds. I share here the process, the results, and I welcome any feedback.

Data Cleaning

I use two data sources: - CJ contest result data, downloaded using vstrimaitis code, see details in his great blog post. From here I get the list of contest participants and what problems they solved.

  • CF users data, downloaded using CF API. From this, I get the current rating of every CF coder.

I assume that many coders use the same username across different platforms. If for a given CJ contestant I find a CF user with the same name (case insensitive), I assume they are the same person. I assign to each CJ participant the rating of the corresponding CF coder, and I discard all other participants.

Difficulty Estimation

The formula used by CF to determine the difficulty of a problem is not public. However, the main idea is that you have a 50% probability of solving any problem with difficulty equal to your rating. Some details here. So I divide contestants into buckets wide 100 rating points (a 1450 and a 1549 coders fall in the 1500 bucket), and I see what bucket had a 50% rate of success. That's my estimate of the difficulty of the problem. I group together all ratings above 3000 and below 500, or the sample size would be way too small.

Results

Out of the 37398 contestants who submitted something during the qualification round, 11109 have a homonym CF user. Here their success rate on the different problems:

Estimated Difficulty: A <=500 B1 <= 500 B2 <= 500 B3 2000 C1 600 C2 1400 D1 2400 D2 2700 E1 2600 E2 3000

Estimation issues

  • matching profiles across platforms using the username is a bold assumption. I am discarding many coders, for example tourist, who competes as Gennady.Korotkevich in gcj. And, even worse for the estimate, I am probably matching some profiles that correspond to different persons.

  • this was a qualification round where you just needed to score a minimum number of points to pass, with little incentive to do more. Many strong contestants didn't seem to care about solving all the problems. See LHiC for example, who just solved problems E1 and E2. This lowers the problem success rate, and inflates the difficulty.

Any thought?

Tags codejam, gcj, difficulty, 2021

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en7 English areo 2021-04-11 15:05:47 549
en6 English areo 2021-04-02 21:02:01 0 (published)
en5 English areo 2021-04-02 21:01:17 1108 Tiny change: 'td,dr: A 900, \' -> '**td,dr**: A 900, \'
en4 English areo 2021-04-02 20:37:22 2249 Tiny change: '\frac{1}{1}$\n\n\nWh' -> '\frac{1}{1+10^\frac{R-D}{400}}$\n\n\nWh'
en3 English areo 2021-04-02 15:23:43 1828
en2 English areo 2021-04-02 14:40:39 2
en1 English areo 2021-04-02 14:40:01 1046 Initial revision (saved to drafts)