Google Code Jam Difficulty Estimation — 2021 Qualification Round

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	jiangly	3578
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	tourist	3565
8	maroonrk	3531
9	Radewoosh	3521
10	Um_nik	3482

#	User	Contrib.
1	maomao90	174
2	awoo	165
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	orz	146
8	SecondThread	146
10	pajenegod	145

Have you been wondering what is the difficulty of Code Jam problems on a codeforces scale? me too.

I make a simple estimate for the 2021 qualification round (link), and I plan to do it also for the upcoming rounds. I share here the process, the results, and I welcome any feedback.

Data Cleaning

I use two data sources: - CJ contest result data, downloaded using vstrimaitis code, see details in his great blog post. From here I get the list of contest participants and what problems they solved.

CF users data, downloaded using CF API. From this, I get the current rating of every CF coder.

I assume that many coders use the same username across different platforms. If for a given CJ contestant I find a CF user with the same name (case insensitive), I assume they are the same person. I assign to each CJ participant the rating of the corresponding CF coder, and I discard all other participants.

Difficulty Estimation

The formula used by CF to determine the difficulty of a problem is not public. However, the main idea is that you have a 50% probability of solving any problem with difficulty equal to your rating. Some details here. So I divide contestants into buckets wide 100 rating points (a 1450 and a 1549 coders fall in the 1500 bucket), and I see what bucket had a 50% rate of success. That's my estimate of the difficulty of the problem. I group together all ratings above 3000 and below 500, or the sample size would be way too small.

Results

Out of the 37398 contestants who submitted something during the qualification round, 11109 have a homonym CF user. Here their success rate on the different problems:

Estimated Difficulty: A <=500 B1 <= 500 B2 <= 500 B3 2000 C1 600 C2 1400 D1 2400 D2 2700 E1 2600 E2 3000

Estimation issues

matching profiles across platforms using the username is a bold assumption. I am discarding many coders, for example tourist, who competes as Gennady.Korotkevich in gcj. And, even worse for the estimate, I am probably matching some profiles that correspond to different persons.
this was a qualification round where you just needed to score a minimum number of points to pass, with little incentive to do more. Many strong contestants didn't seem to care about solving all the problems. See LHiC for example, who just solved problems E1 and E2. This lowers the problem success rate, and inflates the difficulty.

Any thought?

Rev.	By	When	Δ	Comment
en7	areo	2021-04-11 15:05:47	549
en6	areo	2021-04-02 21:02:01	0	(published)
en5	areo	2021-04-02 21:01:17	1108	Tiny change: 'td,dr: A 900, \' -> 'td,dr: A 900, \'
en4	areo	2021-04-02 20:37:22	2249	Tiny change: '\frac{1}{1}$\n\n\nWh' -> '\frac{1}{1+10^\frac{R-D}{400}}$\n\n\nWh'
en3	areo	2021-04-02 15:23:43	1828
en2	areo	2021-04-02 14:40:39	2
en1	areo	2021-04-02 14:40:01	1046	Initial revision (saved to drafts)

Rev.

Lang.

When

Comment

en7