Yet Another Rating System for CodeForces

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

Finally, I finished the implementation of an improved version of TrueSkill rating system that EbTech named "TrueSkill from St.Petersburg".

TL;DR

Results are here. Full repository is here.

Important notes on results

Note that those results were obtained by running on the EbTech's testset (rounds only before Codeforces Round #645 and team contests excluded) and could not represent current standings. Also, this file contains only users with at least 10 contests and at least one contest in 2020 (otherwise, the file would be very large). However, you may obtain full rating by running the algorithm in your computer, it would take about a minute. Also note that results may seem not very impressive since I chose the first parameters that come into my mind and they are surely should be changed.

What is this rating system?

It is like original TrueSkill, but without its problems, like multiway ties. What is TrueSkill? It is a generalisation of Elo/Glicko to the case of many simultaneous teams competing. However, original TrueSkill was made to deal with many players, large teams and not so many ties and should not work correctly in our case. The improved version of TrueSkill was designed for the sport version of "What? Where? When?" which means it solved original TrueSkill problems and perfectly match CodeForces-like competitions.

Pros

It is mathematically perfect. It does exactly Bayesian inference on the model with many simultaneous players (just like CodeForces). Therefore, no other rating system could be more precise. However, there is a room for customisation
It is very fast. As I already mentioned, it processes almost all CodeForces history in less than a minute. The complexity of processing contest with $$$n$$$ players is $$$\mathcal{O}\left(n \log \frac{1}{\varepsilon} \right)$$$ where $$$\varepsilon$$$ is the designated precision of users' ratings.
Teams out-of-the-box. Despite the fact that results above were obtained on tests with team contests excluded, it is not the rating system issue. The only reason why I excluded team contests from the testset is that CF-loader I took from EbTech's repository doesn't process team contests and I don't want to make loader by myself. However, now team performance is computed as sum of player performances, but it could be easily changed to any other formula.
Reduced rating inflation, explicit performances, visible rating feature and more.
tourist on the top of the rating!

Cons

Some formulas lack of numerical stability. That happened because formulas from the original article contain some mistakes (at least they fail on some obvious tests like $$$(-m) \cdot \chi_{[-\varepsilon, \varepsilon]} = -\left(m \cdot \chi_{[-\varepsilon, \varepsilon]}\right)$$$) and I can't find the way they were achieved. However, they seem to be near the truth and they are numerically stable, so I believe that the could be fixed.

Further discussion

Rating system parameters

As I already mentioned, I spent not so many time choosing the best parameters for CodeForces case, so there is still some work with data ahead.

Normal vs logistic distribution

Which distribution expresses player rating and performance better? Any choice is very debatable. Currently my implementation uses normal distribution inside, but it could be easily replaced with logistic or any other distribution, the only thing you need to do is to implement operators like multiplication, division, addition and two multiplication on indicator operators (last two are more tricky than others, but still not very hard, see article appendix for explanation what to do and this file as an example for normal distributions).

Large multiway tie on the last place

Codeforces standings has an unusual artifact of a large multiway tie of a players with zero score which does not perfectly match the theory and possibly affect those players ratings too much. There are many ways to deal with this problem, e.g. exclude them from rating at all, make separate tie-$$$\varepsilon$$$ for them, etc.

Rev.	By	When	Δ	Comment
en10	nikgaevoy	2020-07-04 21:59:43	0	(published)
en9	nikgaevoy	2020-07-04 21:59:24	351	(saved to drafts)
en8	nikgaevoy	2020-07-04 21:37:27	0	(published)
en7	nikgaevoy	2020-07-04 21:36:57	101
en6	nikgaevoy	2020-07-04 21:32:20	5	Tiny change: 'case when when many' -> 'case when many'
en5	nikgaevoy	2020-07-04 21:31:51	3	Tiny change: '. Also, this file cont' -> '. Also, the file cont'
en4	nikgaevoy	2020-07-04 21:30:49	488	Tiny change: '5.jpg)\n\nPros\n' -> '5.jpg)\n\n[cut]\n\nPros\n'
en3	nikgaevoy	2020-07-04 17:28:43	264	Tiny change: 'n[cut]\n\n### Pr' -> 'n[cut]\n\n\n### Pr'
en2	nikgaevoy	2020-07-04 17:23:38	5
en1	nikgaevoy	2020-07-04 17:21:47	4673	Initial revision (saved to drafts)