AI language models and cheating in online contests

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

Hello, Codeforces!

I am one of the moderators of Luogu, which is the largest competitive programming online judge system and online community in China. We hold online contests frequently.
To emphasize academic integrity, we run an anti-plagiarism system to catch the cheaters after every contest.
Traditionally, all cases of cheating are like “person A uses person B’s code and directly submit it in the contest”. But as more and more people are knowing how to use ChatGPT or other AI language models, the situation is becoming more and more complex.
Now after every easy enough contest (we also have Div. 3 and Div. 4), those easy problems which are capable for ChatGPT to solve, will contain an amount of AI generated submissions.

I kindly ask other contest-hosting platforms, what’s your solution to this problem? How to deal with these AI generated submissions, should we regard these as cheating and disqualifies the submitters? If so, is there any possibilities that there will be many false positives, which their codes are written by themselves but are misjudged as AI generated?

Here in Codeforces, does anyone know how did Codeforces do? Or how did AtCoder do, as their AtCoder Beginner Contests’ first few problems are easy enough? If you are organizing other online platforms, and have some experience, please comment below! Thank everyone who discuss constructively!

Comments (41)

Write comment?

BF_OF_Priety

9 months ago, # |

+67

off-topic : why not make luogu an international platform just like atcoder ? Atcoder also was only for Japanese people at first , then it converted into international platform

→ Reply

thelastdinner

9 months ago, # ^ |

-51

in fact people from anywhere can use luogu in www.luogu.com.cn

PsychoPinkQ

+85

but they probably can't read Chinese LOL

PinkieRabbit

+77

Actually Luogu, as a company, has plans to going internationalized, but this depends on our CEO’s thought, I’m just a student moderator. As the community and user will change completely overseas, this plan is suspended. You may not know, Luogu’s CEO has gone to Japan, where he had a friendly conversation with, and asked for advice from AtCoder’s CEO, Takahashi-kun (his handle is chokudai on Codeforces).

stash

8 months ago, # ^ |

chokudai

VLamarca

+55

My opinion: using AI should not disqualify the users. Why bother with it? Too much work with false positives. Lets just hope AI does not get good enough for Div1/Div2 :)

t0uris

+15

PinkieRabbit why you don't make it English??

EnDeRBeaT

+50

I don't think codeforces does anything about AI generated code, because its rules state that generated code is allowed if:

the code is generated using tools that were written and published/distributed before the start of the round.

And ChatGPT is definitely counted as one of those tools.

← Rev. 2 →

+27

That’s very convincing, this rule is very tolerant. Though I doubt that this “pre-ChatGPT” rule may be considered outdated now?

Also in Luogu we have different rule

比赛期间，选手可以使用自己在比赛开始前编写好的代码；禁止使用他人编写的代码，无论这些代码是否在比赛前编写完成

which means “during the contest, participants may use their own codes written before the contest, as long as the codes are written by themselves; it is prohibited to use codes written by others, no matter if they were written before the contest or not.”

This rule forbids participants to directly use code templates by others (tourist’s template for example (if you (the reader) is not tourist himself)). Actually USACO has similar rules. No matter if you agree with this rule or not, the Codeforces’s rule and result are obviously not applicable here.

+16

Even if the rule is out of touch with current reality, there is little you can do about ChatGPT. Tools that predict if the text was AI generated exist for half a year, and what? They flag US constitution or Bible as AI generated. It's very unreliable, any semi official text is likely to be labeled as AI generated, just because of how slick ChatGPT is.

It will probably be worse for code, since there are much less individual quirks for programming and they are much more subtle, training some model to distinguish that is a hell of a job.

+19

The situation is, ChatGPT may gives very similar codes according to the prompts, and that triggers our anti-plagiarism system, now that’s very subtle. Caught participants may complain about they weren’t copying others’ code, but ChatGPT’s. Though we can interpret the word “others” as it includes ChatGPT, then those participants can say nothing.

Zain__Mansour

← Rev. 4 →

It's not a problem.. in (div3,div4) most people and ai models can solve a,b,c but only a real person can solve d,e,f in my opinion the problem setters can make a small trick in the problems makes the ai models not able to solve'em

Munneb_Sheikh

I have tried ChatGpt to solve problem code forces to test the power of chat gpt (Note : But never directly uses ChatGPT to solve the contest problem). ChatGPT is not that powerful for solving even the problem A,B of div3. Sometimes may be solve the question but probability is not much. So for as of now no need to worry for chatGPT

Tom66

Make problems difficult enough or contain corner cases(or smth else) so that ChatGPT won't get AC.

The5threich

+14

The soul purpose of such rounds is to encourage newcomers to give contests.
I don't see making them harder as a good solution, but maybe that's the best we have...

Perpetually_Purple

+43

If chatgpt can solve the problem correctly then even a monkey can.

+23

That is not the case, today a Luogu Monthly Div.1 and Div.2 has ended. The problem Div.2 B is

Given a length $$$n$$$ ($$$1 \le n \le {10}^5$$$) sequence $$$a_1, a_2, \ldots, a_n$$$ consisting of non-negative integers $$$0 \le a_i < 2^{20}$$$.
If a subsegment $$$a_l \sim a_r$$$ ($$$1 \le l \le r \le n$$$) satisfies $$$\bigoplus_{i = l}^{r} a_i = 0$$$ (that is, the XOR-sum of the subsegment is zero), then it’s called a good subsegment.
In one operation, you can choose a good subsegment $$$a_l \sim a_r$$$ and reverse it, i.e., the subsegment becomes $$$a_r, a_{r - 1}, \ldots, a_{l + 1}, a_l$$$ after the operation.
You can perform the operation zero, one, or multiple times. The goal is to maximize the number of good subsegments in the end.

And ChatGPT solves this problem.

You can think of the problem for a while. I don’t think this problem is easy enough for a monkey to solve.

TwentyOneHundredOrBust

+73

I just fed this to chatgpt. It threw up a bunch of code and a completely nonsense explanation and because the problem is a trick question it just happens that it's right...

Maybe the lesson is not to give trick questions?

If you are worried, why not just check if the problem is solvable by chatgpt before the contest? You can use the api, which they claim is not used for training data and deleted after 30 days. (Of course if someone works at openai and knows your account that's another problem.)

Golovanov399

Can a language model do an anti-plagiarism check?

debugging_since_epoch

I am not sure if that is possible with code

How do you think it is done now?

There even was a contest where participants had to write their own anti-plagiarism check: https://codeforces.com/contest/537

like when it is creating sentences it is easy to detect it , but when someone generates a code they can easily change things and it wouldnt be accurate + they can still get the idea from gpt and code it on their own

abhishekJr

8 months ago, # |

PinkieRabbit would you share which algorithm you implement in anti-plagiarism module. I googled but could not find a promising solution. We can't use jaccard, cosine similarity as they do not catch sequences leading to more false positives. Could only think of LCS on tokenized code submissions. How do you deal with false positives? There are thousands of submissions and there is high chance two user thinking same sequences.

BELEB

-13

I've been contemplating sharing my observations on a this topic, and I believe it's the time to share things anyway.

I have conducted several experiments using LLms to assess their performance in various programming contests. Among the models I tested, GPT-4 has consistently stood out, hence I predominantly utilized it. To enhance its capabilities, I integrated an agent on top of GPT-4 that automates the processes of code writing, testing, and evaluation. Additionally, I employed prompt engineering to simulate a chain of thought.

The results have been quite astounding. I allowed the model to participate in three contests autonomously, and it achieved impressive scores of +1900 and +1600 in div3 and div4 contests "fully solved a Div4", respectively. I've also observed that the manner in which the problem statement is presented to the model significantly impacts its performance. On occasion, minor interventions are required to guide the model to the correct solution.

For those interested in replicating my findings, I recommend using ChatGPT Plus combined with a code interpreter. The outcomes should be commendable. As a side note, my current configuration of the model has a rating of +2000 on LeetCode and even achieved a +2400 performance rating in one of the rounds.

I'm sharing this to highlight the potential of such models. While many believe these models don't make a significant difference, I speculate that an advanced model beyond GPT-4 could potentially solve even Div1 problems.

Congrats, this is the best troll I have read in weeks :)

This is one of the reasons I didn't want to post such comment. And I'm not here to argue but anyways I will record a screencast of gpt4 solving div3 or div4 contest and I'll share it publicly and we'll see if it's a troll.

+44

How long does it take to solve before there is no improvement? A couple of minutes at most? If you can show video of gpt4 autonomously full solving a div3/4 in live contest in 15 minutes (and not "I solved the problem in my brain as a human and now I'm handholding it through writing the code") I'll gift you 20 bucks.

I don't recall claiming that it can solve the contest in 15 minutes. Are you now challenging whether it can solve the problem, or are you questioning the speed at which it can solve the problem? Either way, at least the last time, it managed to ace a Div4 contest in around an hour.

+34

Ok, then just show me it autonomously full solving the live contest with no time limit and I'll gift you 20 bucks. Also claim your 1000 free citations and international fame.

Watergirl

+25

proof or it didn't happen

No, really. You see OpenAI themselves saying that GPT-4 got a whopping 400 rating on Codeforces, and now someone comes claiming it gets to 1600+ rating? I don't (necessarily) accuse you of lying, but such bold claims must be supported by equally robust evidence.

I agree that these are bold claims that require proof, but I don't think OpenAI is interested in or investing time in making their model more optimal for programming contests. As for the 400 rating, I'm sure it's based on the raw output from OpenAI's API. If you understand my comment, you'll see that I'm not claiming it always provides the correct solution on the first try. However, in many cases, the way the user feeds the model samples and prompts helps the model debug or fix the code. In any case, I will post the proof as soon as I can.

+13

I admit that I was too quick to judge. Though, it still sounds to me as too good (or too bad?) to be true, so I'm eager to see your results in more detail.

Ok... I don’t think this is reliable...

My main concern is that ChatGPT may give different participants similar codes, and that will be caught by the system. In this case, should we accept participants’ appeal complaining about they weren’t cheating, and they were just using AI instead?

lakeshaya75

ChatGPT is definitely not strong enough to solve most of the problems. It can review your code to check for potential errors, but still it can't even fix the problem 100% of the time. In reality, I don't think there's any way possible to see if code is AI generated since people can always get the code and edit it to make it more human afterwards (correct me if I'm wrong). The only way to guarantee if someone is cheating is to view their screens, but I'm not sure if that's so possible at such a large scale.

Secondo_is_nub

if I know how to solve the problem but I do not know the implementation, that is where chatgpt comes in.

Hk16

PinkieRabbit there is an youtuber who uploads solution in youtuber shorts at contest time. There are many cheaters who took help from it. Can't we stop them?

You should contact MikeMirzayanov for this.

_Block_Cipher_

I guess use AI modal to check the submission of a particular contestant like how he/she use the write code in there previous submission including all the previous contest he/she had given before.

SXqwq

May it is hard to identfiy Chat-GPT. But I think Chat-GPT can't solve the problem now.May be we don't need to pay much attention on Chat-GPT. By the way,the people who use Chat-GPT to solve problems are often xxs? If they can't use it to solve the problem,we don't need to manage them. But now many people use Chat-GPT to write solutions!I see Luogu can identfiy them and punish them. PS:When will the anchor broadcast Genshin Impact live？（）

DON_F

chatgpt can solve some problems that most people can solve it but I think in the future it will be able to solve difficult problems that need advanced techniques so we must solve these issues before this happens

disponat

My opinion is that using language models to generate code is not cheating. They are legitimate tools for code generation, even if their performance is lacking at the moment. All they do is lower the "skill floor" of programming.

If you think about it, higher level languages like python already do that: you can write powerful code without thinking about memory allocation or integer overflow or ...

They can be used just as well in contests as in real-world problems outside contests.

PinkieRabbit's blog