OpenAI Codex can solve unseen Codeforces problems

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	awoo	164
2	adamant	164
4	TheScrasse	159
4	nor	159
6	maroonrk	156
7	-is-this-fft-	150
8	SecondThread	147
9	orz	146
10	pajenegod	145

In this post, we will use the recent OpenAI Codex product based on GPT-3, first introduced in Evaluating Large Language Models Trained on Code (Chen et al., 2021) to solve some Codeforces problems.

We use the round Codeforces Round 739 (Div. 3) because the current version of the Codex model was released before this round. As the model has been trained on the Github data, it is possible that it memorized some solutions for older problems.

Solving 1560A - Dislike of Threes from the statement only

This is a simple problem, and the statement kind of explains what we should implement. Let's just give the plaintext statement to Codex:

Prompt

Problem: Polycarp doesn't like integers that are divisible by 3 or have the last digit 3 in their decimal representation.

Polycarp starts to write out the positive (greater than 0) integers which he likes: 1,2,4,5,7,8,10,11,14,16,…. Output the k-th element of this sequence (the elements are numbered from 1).

Input The first line contains one integer t (1≤t≤100) — the number of test cases. Then t test cases follow. Each test case consists of one line containing one integer k (1≤k≤1000).

Output For each test case, output in a separate line one integer x — the k -th element of the sequence that was written out by Polycarp.

#include <bits/stdc++.h>
using namespace std;

Code

/*
Problem:
Polycarp doesn't like integers that are divisible by 3 or have the last digit 3
in their decimal representation. 

Polycarp starts to write out the positive (greater than 0) integers which he likes: 
1,2,4,5,7,8,10,11,14,16,…. 
Output the k-th element of this sequence (the elements are numbered from 1). 

Input
The first line contains one integer t (1≤t≤100) — the number of test cases. Then t test cases follow.
Each test case consists of one line containing one integer k (1≤k≤1000).

Output
For each test case, output in a separate line one integer x — 
the k -th element of the sequence that was written out by Polycarp.

*/

#include <bits/stdc++.h>
using namespace std;  // <-- from now on it's Codex

int main() {
	int t;
	cin >> t;
	for (int i = 0; i < t; i++) {
		int k;
		cin >> k;
		int x = 1;
		while (k > 0) {
			if (x % 3 != 0 && x % 10 != 3) {
				k--;
			}
			x++;
		}
		cout << x - 1 << endl;
	}
	return 0;
};

Demo of Codex in action

127276629

Disclaimers

It produces correct code on the second try, the first C++ submission was WA 2. The problem text has been cleaned a bit, and "end in the digit 3" was changed to "have the last digit 3" -- because for some reason Codex thinks "end in the digit 3" means "contain the digit 3".

Solving 1560B - Who's Opposite? from the problem statement and the editorial

Here we have no hope for the solution straight from the problem statement, as it requires some thinking. However, if you give it the editorial, it can produce working code from the first try:

127226971

No formatting is required: just copy and paste the statement and the editorial from the Codeforces website. No inductive bias in user input was needed at all!

What you can do if this is cool or frightening to you

It is a net good to the world if everyone has access to models such as Codex or GPT-3.

If you are a machine learning researcher, or maybe just a competent person in general, try to get involved in EleutherAI, Hugging Face, or something similar. If you want to work on important things, in my opinion language models are very critical right now.

Bonus: What you can do if you own a competitive programming site

The advice below is my personal opinion.

Try to get a tester with Codex access for future rounds. I predict it could become a part of the cheating issue as soon as 2022.
Think of what the rules should say about using this kind of tools.
Prepare for the effects of competitive programming becoming less relevant to job interviews. If your financing model relies on job interview preparation, you will need to change it in a few years. (Codeforces and Atcoder seem to be in the best shape here.)

FAQ

Can language models solve harder problems?

Language models are a new and poorly understood object. Here is a gross oversimplification of the current conventional wisdom:

GPT-3 is a language model. Its main purpose is translating between different representations of the same data. It can do English to German, it can be used to describe imagess, and with Codex it can translate textual commands to code. Codex can solve the Div3 A from scratch because the solution just implementation of the text, but it needs the editorial to solve Div3 B.

As of 2021, language models are not intended for solving algorithmic puzzles or producing mathematical proofs. It is likely we will have a much more powerful Codeforces solver by say 2026, when the research manages to combine proof search methods with the GPT-3 architecture. If I manage to read enough papers on this, I might write more on my academic blog.

I can't see the demo.

If you can't access imgur, go to this link.

How to get access to Codex?

Have some academic credentials, and fill in the waitlist form at OpenAI's website with an interesting research proposal. Please do not use Codex for bad purposes such as cheating or spam.

How to integrate Codex with vim? I got access but I can only use it in the Playground.

Use vim_codex with your OpenAI keys.

What are the Codex parameters in the demo?

I used the OpenAI API from Python as follows:

API call

openai.Completion.create(engine='davinci-codex', prompt=input_prompt, best_of=1, temperature=0.01, max_tokens=300)

Comments (15)

Show archived | Write comment?

Wind_Eagle

3 years ago, # |

+43

Now cheaters will try to use this to solve D2A :)

→ Reply

hbarp

3 years ago, # ^ |

+30

Ashishgup round div2As : Am I a joke to u ?

lethan3

← Rev. 2 →

+33

I wonder, if you ask it to solve $$$A + B$$$ but with high bounds, what will the output be?

For example if you give it $$$A, B \leq 10^{18}$$$ will it use long longs or stick with int?

dpaleka

← Rev. 3 →

It is very hard to use with C++, because I don't know where to put the instructions.

Prompt+code

  /*
   * Problem:
   * Find the sum of two number A and B.
   *
   * Input:
   * Input two positive integer A and B, where A, B <= 10^18. 
   *
   * Output:
   * Print A+B.
   */
// Codeforces

#include <iostream>
using namespace std;

int main() { // <-- prompt ends here
    int a, b;
    cin >> a >> b;
    cout << a + b << endl;
    return 0;
}{

Not only it gets the type wrong, but notice the trailing { too.

Can you think of an equivalent test for Python? I think it may genuinely work better in Python, for now. Or maybe I just haven't figured out how to make it write nice C++ yet.

+28

Wow, thank you for trying it out! I guess the AI only focuses on the key details and disregards what it thinks as unnecessary such as the bounds here.

For python maybe try the following?

Given an integer $$$N$$$, print out $$$10^N$$$. Bounds: $$$N \leq 10^8$$$.

The idea here is that if the AI is smart enough, it will print out trailing zeroes instead of actually calculating the value, which may lead to memory overflow?

Edit: Oops, that only uses about 41 MB. Can anyone think of a similar but better test that can memory overflow, but has a simple solution without brute-force?

Shy91827364

OpenAI Codex can solve unseen Codeforces problems

Spoiler

WolfBlue

That's amazing! I think there is no need to fear an AI solving significantly harder problems though. I think that making a robot that consistently solves problems at Div 2 D or above is about as hard as making a general intelligence. And if that happens then there are a lot of more important things to think about than cheating on codeforces :P.

Though for Div 3 (and div 2 if there are improvements), clearly the solution is to use only long and hard to comprehend problem statements. Atcoder will need to become more like cf and have random commentary about Baby Ehad's first words sprinkled throughout the problem statement.

We'll see about this. The capabilities of machine learning (and computers in general) are not a subset of human capabilities, nor do humans and computers find the same things difficult.

No one has had a good track record of predicting the order in which fields of human activity get automated away.

LuchkinVyacheslav

+24

Even solving Div2 B seems implausible to me: assuming this model is equivalent to Copilot, it was trained with the goal of remembering a lot of code and then outputting it with minor changes, so that the researchers don't get angry at it for copying, and can make bold claims. This implies that it isn't useful for problems that can't be solved by copying existing code from Github.
And I believe all current language models have zero understanding of text (which is why they are often tested on text completion, which is also just copying with minor changes), but small changes in the problem statements can lead to large changes in the solution (and recognizing this obviously requires understanding of the statement).
I think the for some reason Codex thinks "end in the digit 3" means "contain the digit 3" part proves my point: it has seen the code for "contain the digit 3" more often than for "end in the digit 3", and can't even see that these things are different. And this was an implementation problem, which is the closest it gets to "translate statement from English to C++".

_winprn

nice blog! btw, can you show how me how to submit to codeforces using command line like you did in the imgur link? thanks in advance.

Use this fork of cf-tool, the original one was made by xalanq but they don't maintain it anymore.

Thank you!

TimeTraveler

+44

Imagine one day it beats tourist like Deep Blue beat Garry Kasparov at chess.

-is-this-fft-

+20

As of 2021, language models are not intended for solving algorithmic puzzles or producing mathematical proofs. It is likely we will have a much more powerful Codeforces solver by say 2026, when the research manages to combine proof search methods with the GPT-3 architecture.

Can you elaborate on the 2026 prediction?

As far as I can see, the problem-solving ("proof searching") part of AI is THE hard part of CP, for both humans and AI. Everything else is just kinda whatever.

Well, it's not so much of a well-founded prediction. If we were on a different site, maybe I would have put the words "hot take" before it.

For a basic proof search prototype, see Generative Language Modeling for Automated Theorem Proving (Polu and Sutskever, 2020) by the same company.

For the reason of my optimism with respect to learning to search in difficult action spaces, see Open-Ended Learning Leads to Generally Capable Agents (DeepMind, 2021).

Want to long bet say 10€ on there being a model where you feed it only the problem statement and samples, and it AC's at least 20% of the Div1 A's in 2026? (Conditioned on the texts not being written adversarially. Also, let's give the model say three submits per problem.)