Since this December, I've been exploring codeforces data to answer the question of how best to improve your rating. Today, I'm presenting to you all the final project!
The first problem in trying to do data-driven analysis is trying to get data. Codeforces does have an API, but it's quite difficult to get data from it at a large scale. So first, I cleaned everything up and created two big datasets, and I've published them to kaggle. So, if you think my analysis is garbage, you can download the dataset yourself and try it yourself!
Dataset: The submissions and contests results for 60k recently active users
Dataset: The final standings for every codeforces contest
With a nice API to gather data, you might expect that there would be very little work to do in order to make these datasets! You would be right, at least relative to if there was no API. But there was still a lot of work to be done. For example, since we are rate limited to one request per two seconds, I had to run the scripts for days in order to get this much data. At first I was too lazy to run an external server and just did it when my computer was open, so it took like a month. Another problem, is that there are a lot of edge cases to consider. For example, I was running the script during the season of magic, and people kept changing their handles! There are also a lot of different types of contests. Did you know that Rated team contests exist? Also, you might have guessed that not every participant in a contest has a rating change, but did you know that in some contests, non-participants can have their rating updated due to merging the results table with another source?
After getting all the data, I also needed to post-process it so everyone has their true rating rather than the displayed rating. Unfortunately the API does not give this value. The dataset is pretty big and just running the script to modify the rating took about 30 minutes :( Similarly for all of the data analysis, it takes quite some time.
After getting the data, I began doing analysis and creating some charts. I looked at a lot of features, such as first solve time, rate of getting incorrect answers, and the difficulty of problems. For many of the features, I could not find great insights, but there are still quite a few interesting graphs in here! I hope you enjoy this data analysis that I've been working on for the last 3 months!
I've tried to solve only very very hard problems before (+700 rating), and it mostly lead to me giving up and not coding anything, and I don't think it was very effective. I've also done easy ones under my rating and after a while I just started getting scared of coding more difficult problems. After doing this project, I've changed my practice strategy to focus on slightly hard problems (+200) and I feel like I'm able to solve problems a bit more consistently + have more interest and motivation.
This project was made for a course at Carnegie Mellon, though a lot of the work, mainly gathering data, was outside of scope of the class.