Fixing the contest management mess

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
2	maomao90	160
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
9	TheScrasse	144

...or "How legacy systems fight obstacles no sane person would introduce in the first place".

Hello, Codeforces!

I am sure that those of you who tried to host their own programming competitions or write their own utilities for contest management are aware of the state of affairs regarding contest hosting. There are lots of incompatible formats, no one knows exactly how stuff ought to work, Polygon is full of nondescript and seemingly contradictory options, ej-polygon never seems to work quite correctly, modifications have to be applied to jury archives to make them work on your particular installation, non-trivial problem types require constant maintenance and hacks, ad infinitum.

Personally, I stumbled upon all these problems when I tried to get my own judge system working and suddenly discovered that the difficult part is not judging complex problems, but handling artificial complexity introduced by legacy software and standards. So here I am, attempting to solve these problems.

Firstly, I introduce a new format, which I shall call problem.xml format, which, as is perhaps obvious, is based on the Polygon format. I introduced one or two special cases here and there to ensure it's 99% compatible with the archives generated by Polygon presently. Unlike Polygon's format, it's completely documented, and as little leeway is allowed as possible while not compromising efficiency.

This format enables many, almost arbitrary problem types, in addition to the default types: input/output, interactive, run-twice, and problems with graders. For example:

Problems with custom scorers (better known as valuers by Ejudge adopters) are supported. This means that the points of a solution are not necessarily equal to the sum of points of each test; rather, any relations are possible, up to negative scores, efficiency-rated programs, and "write a program that prints your username".
What I shall call formulaic problems, when the user solution outputs a formula or a universal algorithm that is then executed by the judge.
Optional compilation on each test, which would come in handy for some practical development competitions.
Output-only problems, meaning that the user is asked to submit a ZIP archive that contains the answers to each test.
(Optional support) arbitrary strategies, which allow problemsetters to generalize the above as they see fit: run-thrice problems, compile-time testing, and CTF-like competitions are within reach with just a few lines of code,
Arbitration, allowing for marathon problems, which means that the points awarded to a submission may depend on the results of other submissions.

For existing platforms that support Polygon format or alike, few modifications will be necessary to get everything but strategy and arbitration working: I am hoping that major vendors are up to the task. Strategy support is somewhat tricky to implement, so that part is up in the air for the moment (not that this diminishes the rest of the work). Arbitration should theoretically be easy to get working, but might require some modifications to the software architecture.

The draft of the specification is available here: https://github.com/imachug/problem-xml-specs. While I think that the spec is mostly finished and I'm publishing it here for better visibility, I'd be glad to listen to your input and apply changes where necessary!

Tagging some people for better visibility—your input is highly appreciated: geranazavr555, MikeMirzayanov, grphil, andrewzta, dkirienko.

f, g = File(), File() p = Pipe() # f is not being written to, so first invocation of user starts right away user(stdin=p, stdout=f) # f is being written to, so the checker is run after the first user program checker(test.input, f, test.answer) # f is being written to, so second invocation of user starts after the first user program user(stdin=f, stdout=p)

<interactor> <source path="files/interact.cpp" type="cpp.g++17"/> <binary path="files/interact.exe" type="exe.win32"/> <runs> <run>1</run> <run>2</run> </runs> </interactor>

Comments (18)

Write comment?

purplesyringa

2 years ago, # |

Auto comment: topic has been updated by purplesyringa (previous revision, new revision, compare).

→ Reply

TypicalFake

So cool! Hope that changes in Codeforces and Polygon systems will be made as soon as possible

playerr17

I playerr17 support purplesyringa with his problem.xml format. I`m certain it will be useful for authors and problem setters.

Imho this specification is easier to understand than Polygon format. Polygon has unintuitive and confusing interface, same with its standards.

Good luck with your new judge system :)

2 years ago, # ^ |

P.s. I don't hate Polygon. Polygon is great system, that has no alternatives and makes problem setting much easier. I personally use Polygon, but I think with this specification it will be better.

geranazavr555

← Rev. 2 →

+54

In fact, the motivation for this specification is not very clear. As far as I know, from the testing systems used in the Russian-speaking part of the community, only codeforces most fully supports the format of polygon packages and the current problem.xml in particular, the rest support either some kind of subset, or even something custom (like, for example, Yandex).

Accordingly, even if the new format is made fully backward compatible with the Polygon one, this, unfortunately, will not save you from kilometers of crutches and legacy code.

And what happens in the non-Russian-speaking part of the community, for example, at AtCoder, CodeChef, Leetcode, etc etc, in general, we can only guess. It would be cool if guys familiar with how problems are prepared for the listed or other foreign platforms came here. After all, maybe they don’t use the polygon at all during preparation, but they have their own, perhaps more successful format.

I read the document itself, IMHO, there is a global conceptual flaw: I think it is impossible to leave the writing of the strategy to problemsetters, because in the general case, as far as I understand, the strategy is almost an arbitrary program, moreover, it also has increased rights during the execution (because it needs to be able to run other programs). What if the strategy catches TL? ML? Trying to break the system?..

← Rev. 5 →

+12

To me, it was not very clear why this was not done before. I can only see the lack of a single documented format and lots of forks and interpretations of the existing one as an utter failure of the competitive programming community; and the lack of action towards putting right what once went wrong is astounding. I use a lot of software (though mostly FOSS), and I can't name one other popular format that is both undocumented (in any way, even unofficially) and used by many vendors.

People use Polygon: Codeforces uses it, almost everyone in the Russian CP world uses it, convertors are written for every other judge, and there's no alternative with better features. If it looks like an industry standard, used like an industry standard, and regarded as a industry standard, then there is no reason for it not to be an industry standard.

Kilometers of crutches and legacy code already exist, there's little we can do about it. We can, however, work towards decreasing the rate at which the pile grows. Introducing run-twice programs was a major change, introducing graders was a major change, and I have lots of ideas on my mind that would be considered major changes too. And when I say "major change", I mean rewiring half the architecture. I eyed all of this and encapsulated it till any other idea I could think about wouldn't require any changes. I'm advising people to spend some time now to save yourself time later.

It would be cool if guys familiar with how problems are prepared for the listed or other foreign platforms came here. After all, maybe they don’t use the polygon at all during preparation, but they have their own, perhaps more successful format.

A friend told me CodeChef has a rather dumb Polygon analogue, little more than input and output files and a checker. Can't say much about the other platforms, but given that I haven't seen any non-standard problems on AtCoder, I don't think it's much different elsewhere. CMS, according to the examples on GitHub, seems to only support a few hard-coded types of problems, nothing Polygon doesn't. I'm interested in how Leetcode works, though: it seems to use graders that work for multiple languages, and while I dreamt of that too, I deemed it too non-trivial for this iteration.

Of course it is possible to make it work. All you need is run the strategy in a sandbox like an interactive problem, and pass commands via standard input and output streams. I think I put it like this somewhere but can't find it now. Anyway, I don't see any security problems: a strategy merely tells the judge what to run and expects to be able to read and modify files generated during the submission judgement, but nothing else. And DoS isn't a problem either: even an interpreted language doesn't need more than a few seconds of CPU time and tens of megabytes to trigger evaluation of thousands of tests, or even await and parse the results, and await asyncio.sleep(1000000) can be fixed by disallowing the strategy to run idly for more than a few milliseconds when no other programs are run.

FairyWinx

In codechef you have to prepare tasks on the polygon. And when the coordinator (or someone like that) accepts your work, you transfer tests to codechef judge. And in judge, you can only add a test, add a check, add the correct output for the test. And add a few more another settings. At least, that was about six months ago

nsychev

+38

While I think that the spec is mostly finished and I'm publishing it here for better visibility, I'd be glad to listen to your input and apply changes where necessary!

So, if I understand correctly, you consider this strategy as (almost) “production-ready”. But, in fact, I think it lacks some important points (which, as I suspect, would, in turn, require to rewrite at least the whole DSL part).

I agree with geranazavr555 about impossibility of implementing strategies in suggested way.

At least, I didn't see any word about how judging reports should look like (both for jury and for participant): like, which files are exactly taken into account and should be saved for future usage? E.g. we have twice-run problem. Do jury really need all 8 “result” files as an output? And what about content of pipes? I don't think Linux pipe is enough tool to record all content in the pipe, which, in the turn, may be more useful for examining the strategy than something like first interactor stderr.

Provided idea of Python strategy code (with unfortunately no formal spec to examine) isn't clear about when execution of interactive problems start: there is just no await of anything. It seems, we should use smth like asyncio.gather with some undefined behaviour on edge cases: e.g. events like “one process sends data, and another one quit” may lead to different results depending on which process is interactor, and which is solution.

Also, as it was already told, some judges (e.g. Yandex Contest) have features, which are not covered by the spec. I'm almost sure nobody will wait some committee to approve the changes to standard before implementing needed feature. And in some time we'll get again to what we have now — several judges have done the same thing different ways. While it's possible to standardize such features later, it may be impossible to convert all “old-style” problems to a new format.

By the way, this spec is too restrictive. Just a few examples: you write, that you restrict file-only I/O just because you personally think it's good. It may be good for regular Codeforces round, but not for educational contests — I use judges to teach programming, and I would be very disappoined if one day all judges break my lab on working with files because some competitive programming enthusiast invented a spec which disallows that. Or you have some strict naming conventions for file/class naming, which seems absolutely non-user-friendly and excessive. By the way, how file would be named, if user types it into the field right in the judge interface?

OK, let's move further. Ejudge already have (far more simplified) “strategies”, as described here (Russian only). While I don't think this design is great, it requires “proxy strategy” to be implemented, and there is no obvious way except to save such proxy binary in each problem package (as testlib.h now in some judges).

Another issue with the whole idea is that while you change current problem.xml file so much that Polygon (as the main source of problems for ALL judges) still has to be rewritten to support all possible problems, you leave some strange ideas without any objective but for “compatibility”: XML as file format and some ad-hoc solutions (like putting “run-twice” in tags, or having both testsets and groups in the same time). For me, XML is dealbreaker, and the most I'll be able to provide is yet another converter from Polygon archive to our internal package format.

So, while the general idea seems very nice to me (honestly, I tried to did something like yours idea, not as a public standard, but for one judge), now it seems far enough from ready, and even after that, I don't believe judges will have a motivation to implement it.

P.S. I wouldn't be so sure to advise everyone to run untrusted (for judge) code at ring 2 :)

Firstly, thanks for a reply!

At least, I didn't see any word about how judging reports should look like (both for jury and for participant): like, which files are exactly taken into account and should be saved for future usage?

I'm the proponent for data preservation: if you can generate stuff and have space to save it, do it. So, yes, if you have a run-twice problem, save all the generated files, they may come in useful when you rejudge a deterministic submission after the competition and get a different result and have no idea what went wrong. For the participant, standard rules apply: show them the verdict and metrics during the contest and everything else afterwards. I think this is mentioned in 6. Valuation.

And what about content of pipes? I don't think Linux pipe is enough tool to record all content in the pipe, which, in the turn, may be more useful for examining the strategy than something like first interactor stderr.

I left that implementation-defined as I'm not sure it's always feasible. I think that if you make split the FIFO into two pipes and have an intermediate process that reads from one end and writes to the other end while logging the data, or simply inject the logger into the libc's read and write functions on the interactors's end, it would work.

Provided idea of Python strategy code (with unfortunately no formal spec to examine)

I'm not quite sure what you'd expect from a formal spec. Perhaps the most complete spec would be a sample implementation; I'm working on releasing that now.

isn't clear about when execution of interactive problems start: there is just no await of anything

Pipelines says this:

...Invocation is performed even if the future is not awaited.

So when you invoke a program, it starts whenever the judge decides, which is, if you care about efficiency, either right away, or after the previous test finishes, or perhaps when a new invoker is available, or something.

Of course, you could just prefix the invocation with await, but that would make the tests run consecutively independent of the judge's parallelization features; or you could spawn a coroutine per test and use asyncio.gather, but that would just be more code that means the same thing.

It seems, we should use smth like asyncio.gather with some undefined behaviour on edge cases: e.g. events like “one process sends data, and another one quit” may lead to different results depending on which process is interactor, and which is solution.

Yeah, some clarifications are certainly in order. Let's say that CF overrides everything else, and otherwise all failure verdicts of a user process override the rest. So if the user program just exits successfully, and the interactor receives nothing on the pipe, it returns PE or WA and it's alright. If the user program crashes and the interactor receives nothing, it's still PE or WA, but RE overrides WA. And if the interactor crashes and then the user programs fails, you get CF, which is what you want.

Also, as it was already told, some judges (e.g. Yandex Contest) have features, which are not covered by the spec.

That's why I'm posting this here and not going ahead with implementing yet another format immediately--to hear more about these features. So far I didn't see anyone present any examples.

I'm almost sure nobody will wait some committee to approve the changes to standard before implementing needed feature. And in some time we'll get again to what we have now — several judges have done the same thing different ways. While it's possible to standardize such features later, it may be impossible to convert all “old-style” problems to a new format.

I have attempted to encapsulate everything related to programming here. You can compile programs however you want, you can run programs, in any order, with any arguments, in any structure--what more do you need? The only fixed detail left is the existence of tests and that the user submits a file. If you ever get tired of tests, just add a single empty test and do everything in the strategy--at least it'll work in other judges then. If you ever need the user to submit a directory or a text string, simulate that with an Archive and a plain-text file. What more do you need?

By the way, this spec is too restrictive. Just a few examples: you write, that you restrict file-only I/O just because you personally think it's good.
I use judges to teach programming, and I would be very disappoined if one day all judges break my lab on working with files because some competitive programming enthusiast invented a spec which disallows that.

Don't put words in my mouth. Where did you find this?

Or you have some strict naming conventions for file/class naming, which seems absolutely non-user-friendly and excessive.

Mhm, alright, that was certainly unnecessary, I legit forgot about that part. I'll rewrite that chunk to allow arbitrary names.

By the way, how file would be named, if user types it into the field right in the judge interface?

I don't know, it's your judge, choose the name yourself? Any name works for me if the compiler likes it.

OK, let's move further. Ejudge already have (far more simplified) “strategies”, as described here (Russian only).

Unfortunately you seemingly inserted the URL as an image instead of a link so I can't read it.

Another issue with the whole idea is that while you change current problem.xml file so much that Polygon (as the main source of problems for ALL judges) still has to be rewritten to support all possible problems,

Polygon will have to be rewritten anyway if it's to support any format except what it does now, so that's a weird argument. And even then this seems wrong: Polygon generates packages but can't parse them, and Polygon-generated archives are compatible. So modifications are only requires if features are to be added.

you leave some strange ideas without any objective but for “compatibility”: XML as file format For me, XML is dealbreaker, and the most I'll be able to provide is yet another converter from Polygon archive to our internal package format.

I mean, I'm not a fan of XML either, and it's hard to find anyone who thinks it's any good, but are you proposing to replace it with a modern alternative so that whoever implements the format has to support both XML and the other alternative?

and some ad-hoc solutions (like putting “run-twice” in tags,

That's not my idea; ask ejudge, or perhaps ROI organizers--that's what I found in their run-twice packages.

or having both testsets and groups in the same time).

They're different though. Perhaps it'd be better to have a single testset and add offline option to groups, but is there much difference?

So, while the general idea seems very nice to me (honestly, I tried to did something like yours idea, not as a public standard, but for one judge), now it seems far enough from ready,

...which nevertheless doesn't mean I can't try,

and even after that, I don't believe judges will have a motivation to implement it.

Maybe legacy systems won't rewrite what already works, but that doesn't mean that they won't consult it if they decide on new features, or that modern judges won't appear. I prefer to once build a system that works and use it till the end of life, and I can't be the only one.

I mean, no major OS supports that without kernel patches, and those who can write a patch know it's not secure to just patch the LDT. Between you and me, though, do you see any problem with this if segment limits are set to disallow access to kernel memory?

Wait though. I mentioned fixed names in very few places, and only when graders are used. Graders have to access the user code somehow, and doing that without fixed module names is difficult.

To give one example, C#: the grader needs to access the user's code somehow, and it has to use some fixed classname, and I hope you're not thinking about reflection. In this case, it is merely recommended for the user code to use Solution (although I might remove this), but there's also a note that says an alias should be added if the user disregarded the rule. And I have to use a fixed name, because if there are several classes, how do you guess which one to use? What could be less restricting than that?

Oh, really, I missed some points from the spec.

I'm the proponent for data preservation: if you can generate stuff and have space to save it, do it.

And it's generally not true (I think that's the reason Codeforces will never show the whole participant output for tests). But, for instance, I'd prefer to save the whole checker output (I know cases when it's more than 1 KB and it's important), and not participant's one. It seems hard for me now to determine “more important” files just from given strategies.

And I don't want to give problemsetters rights to choose that by themselves, because my drives are not infinite, and so I would need to implement resource quotas and etc.

For the participant, standard rules apply: show them the verdict and metrics during the contest and everything else afterwards.

I want it to be more customized. Like show some verdicts during the contest (like how it made on ROI) or show participant a checker output (when it comes to competitive programming, we may give some solution metrics here).

I'm not quite sure what you'd expect from a formal spec.

Something like “Judge should run strategy in environment with a module which provides the following functions:

async def user(stdin: Optional[FileLike], stdout: Optional[FileLike]) -> None

” (here, of course, signature is random, but it doesn't matter for the example)

OK, I still don't understand: is strategy a some custom Pythonic-like DSL, or it's just a script in Python? Because in the latter case, I don't know how to validate strategies at all (because the thing we're facing is a halting problem).

Lack of awaits makes unclear the moment when execution should be started. So, we may do some deadlocks:

f, g = File(), File()
p = Pipe()
user(stdin=p, stdout=f)
checker(test.input, f, test.answer)
user(stdin=f, stdout=p)

I don't know who will write such, but I also don't know how to run it when we have Turing-complete language. This concrete example can be caught, but we can make it a way more complex.

So far I didn't see anyone present any examples.

E.g. in Yandex Contest, the run can have several scores. It is used at Yandex Cup, when we have several scoreboards at once. And, I'll reply here out of order, because that's connected things:

Yes, but still offline is bad :) The only reason for judges have offline groups really “offline” is a lack of computational resources to evaluate all submissions just when they arrived.

However, if we have them, there is no reason to postpone judging process at such tests. So, now we have several scores (visible to participant and visible to jury) instead of re-testing everything.

What more do you need?

Oh, I understand now what seemed me strange at first — the defaults. There is a default for run-twice, but no default for plain text string in answer :) But OK, that's weird issue.

Right in the part about files:

Note that in this example, we allow the user to read from both stdin and input.txt, and write to boht stdout and output.txt (though not at once unless buffering is disabled). The author believes this is more correct than only supporting file I/O.
Unfortunately you seemingly inserted the URL as an image instead of a link so I can't read it.

Fixed now.

so that whoever implements the format has to support both XML and the other alternative

That whoever just don't need to support XML :)

I think it's invention of Polygon owners :) (and I still think it's not late to drop it)

Can't think of any problem now, but I wouldn't be so sure about the part “those who can write a patch know” :)

I see some spec changed today, but I read it yesterday :) Then it's OK if naming is used only for graders.

Before edits, it was like not only for graders, e.g.:

The user submission MUST have file name solution.cl.

But, for instance, I'd prefer to save the whole checker output (I know cases when it's more than 1 KB and it's important), and not participant's one. It seems hard for me now to determine “more important” files just from given strategies.

If you don't want the problemsetter to list what to save, then where do you store this information? Supposedly it's not stored in the problem package, but then why can't you store a boolean map program name (e.g. checker/interactor/user) => save_all_data (bool) in your database?

I think feedback policy ought to support that. It supports both ICPC and all-tests feedbacks a la IOI and ROI. Perhaps the only required modification would be to add a flag to show or hide the checker comment. Would that work for you? (At the moment, on Codeforces this is done using a meta-tag hack)

So basically a signature in Python instead of English. That's going to be easier.

OK, I still don't understand: is strategy a some custom Pythonic-like DSL, or it's just a script in Python?

The latter.

Because in the latter case, I don't know how to validate strategies at all (because the thing we're facing is a halting problem).

Do you need to? We face the halting problem each day, and we solve it with time and memory limits. As I stated elsewhere:

even an interpreted language doesn't need more than a few seconds of CPU time and tens of megabytes to trigger evaluation of thousands of tests, or even await and parse the results, and await asyncio.sleep(1000000) can be fixed by disallowing the strategy to run idly for more than a few milliseconds when no other programs are run.
Lack of awaits makes unclear the moment when execution should be started. So, we may do some deadlocks:

It says in Pipelines that if a file is passed to the program that is being written to at the moment, execution is delayed till the process that is writing to that file finishes (which also implies two programs can't write to the same file at once, but that's reasonable). So, in your example:

So, basically, when the first program starts, there's nothing attached to the write end of the pipe and so it's supposed to terminate immediately, perhaps with RE. This might not be the best resolution, but perhaps a warning should be emitted if a pipe is read from but never written to.

As a side note, it seems to me that the linearity requirement (delay if anything is writing to the file when the function called) guarantees termination.

E.g. in Yandex Contest, the run can have several scores. It is used at Yandex Cup, when we have several scoreboards at once.

It would perhaps be more reasonable to have a verdict that says the solution is alright and stores arbitrary data which is later passed as-is to a scorer and back to the judge. The checker would output a string (the several scores in your case), and then the scorer would parse the strings of each test and merge them.

To play the devil's advocate: this is exactly what the verdict comment is about: a comment can be attached to the PT (or even OK) test verdict, which the scorer will pick up (damn, I forgot about that one though, will add another scorer type), and then return back to the judge.

I believe that's how it originated, and this idea is seeing less traction, but I'd argue that many participants comprehend it as part of the challenge and use different strategies when solving problems when offline tests exist. The nagging doubt in the back of your mind that says your code might just fail stops you from trying things unless you're sure they work, shifting the competition toward a more theoretical end. And then there's MOSH with its own complexities.

However, if we have them, there is no reason to postpone judging process at such tests.

Sure, providing you have the resources. Although that's the common case, we see Codeforces struggle with it despite constant maintenance, we often have offline tests on heavy tasks on the Open olympiad, some places don't have Internet access (say, to AWS) physically, and then there's the amazing "let's set up an invoker on Snark's laptop for this 30-people training" thingy. The lack of resources is more common than we'd all like it to be. You just can't ignore that.

So, now we have several scores (visible to participant and visible to jury) instead of re-testing everything.

Still, wouldn't it be great if the judge could prioritize pretests over system tests? And could please you elaborate on "re-testing everything"? I think there's no reason to re-run pretests.

I was contemplating adding plain-text answer support too, but then decided that making the judge showing a textarea field for the text type ought to be enough.

Perhaps you misinterpreted the comment. I didn't restrict anything: what this quote meant is "I think that only allowing the user to read from stdin and write to stdout is wrong, and if you agree with me, here's how you can make input.txt and output.txt work too". You can use stdin/stdout only, or you can use input.txt/output.txt only, or you can make both work: it's your choice.

Damn, I read that page, and somehow I missed the parallels with strategies. You certainly can implement a proxy strategy, and it would need to be saved (though probably not as binary but as source code?). Whose responsibility would that be, though? If we consider interactive scorers a legacy feature, it could be the responsibility of a ejudge-to-Polygon converter. If we don't (and I'm honestly not sure), it could be added to the spec as another scorer type.

Do you propose creating a new format, completely unrelated to the existing ones? Because writing a converter to Polygon would be trivial, but you'd also need to develop the problems somehow, and if you want to use new features, you can't use Polygon because its format doesn't support them. I bet on praying for Polygon to support what I described here so we don't need to create a yet another polygon alternative from scratch (and being closed-source doesn't help either).

But also, XML's your dealbreaker, really? I mean, yes, it's bad, but there are so many more important details than what characters you use as field separators.

I think it's invention of Polygon owners :)

I don't think so. Archives generated by Polygon nowadays don't require the run-twice tag and specify this elsewhere. This tag shouldn't be used as a decisive factor nowadays, it's just for compatibility with some existing packages.

(and I still think it's not late to drop it)

Should we, though? It's, like, one conditional, it's unlikely to cause problems, and it lets us import a few percent more problems without manual patches.

By the by, there was this patch that didn't make it to mainline which seems to do exactly this. They also support nested rings (i.e. both ring 1 and ring 2 can be used) and interrupt handling, but we don't need either of that.

I see some spec changed today, but I read it yesterday :) Then it's OK if naming is used only for graders. Before edits, it was like not only for graders, e.g.:

That was a failure on my part: that section should only have applied to grader-enabled problems.

If you don't want the problemsetter to list what to save, then where do you store this information?

Now it's hardcoded, as the strategies do.

Supposedly it's not stored in the problem package, but then why can't you store a boolean map

In fact, I still worry that showing all the data (for run-twice, for example) will overload the interface, but OK, some hardcoding will remain here.

Would that work for you?

‌Maybe. In fact, just now I have remembered how it implemented in Ejudge (but I didn't check): problem contains only the tests, and scoring options may be redefined at contest level — so we can use the same problem in different contests under different rules. Maybe such spec should just be moved out of problem description.

Do you need to?

I would be glad if I had such opportunity. Now I think it's possible to make strategy which may (possibly unintentionally) DoS the judge. But I can't make an example from my head.

I wrote my comment not to receive some ad-hoc solution (which, of course, exist), but to tell that imho it should be stated in spec.

If “f is not being written to” is the sufficient condition to run the first user immediately, then in default interactive strategy user will be run before interactor and in the worst case it will cause RE (because at the moment user starts, nothing writes to its input pipe) and in the best user will consume a bit wall clock limit, which is also not perfect.

And could please you elaborate on "re-testing everything"? I think there's no reason to re-run pretests.

Here I don't understand how it will work with the strategies. Do we have to save some “intermediate state” when online tests end, and restart from the same state (by state I mean current global namespace of Python program with strategy) during offline ones? Or we should run the strategy two times separately for such sets?

Perhaps you misinterpreted the comment.

Yes, exactly, now I realized. Sorry for that.

Do you propose creating a new format, completely unrelated to the existing ones?

Yes. If Polygon will not support such format, you still wouldn't be able to use new features. If it will, the format doesn't matter, they can simply change it.

But also, XML's your dealbreaker, really? I mean, yes, it's bad, but there are so many more important details than what characters you use as field separators

Yes. In fact, I have to maintain a converter from whatever Polygon produce to our internal format, but I definitely will not support problem export in XML (because it requires writing more than one line of code). And it may lead to much manual work for whoever who wants to move problems from one judge to another.

Should we, though?

In fact, I think the answer is obvious if there will be no XML (so, such problems should anyway be converted).

‌so we can use the same problem in different contests under different rules.

That's neat! Supposedly you should be able to modify quite a lot:

Statement & title, e.g. Codeforces allows this in mashups,
Scoring settings,
Tests, e.g. if you don't have write access to the problem but have an erroneously passing solution.

There certainly are other use cases I missed. Either way, there are so many possibilities that I'd say it would be reasonable to enable, let's say, problem forking, a la GitHub, instead of merely patching settings at contest level.

I believe this is no different from what we have now. You could make the strategy invoke a problem in an infinite loop, but you can add thousands of tests to a problem too, and that'd be the same thing. The solution is limiting the number of invocations. You could make a run-a-thousand-times strategy, but then you could make the checker or the user program create thousands of threads, and the solution is, again, the same: limit tasks.

This is stated in the spec, but perhaps I should make a different point. The examples I wrote in the spec should "just work", and the existing text should be enough to explain how stuff is expected to work to the implementors. How they will transform that to code is their problem, is heavily implementation- and platform-dependent, and at the same time quite obvious.

Yes, it will run immediately. But the same will happen in a simple judge: either the interactor or the user program will be the first to start. It's not a problem in that case, so it shouldn't be a problem with strategies.

In POSIX, when the problem is interactive, the management process typically keeps the write ends of the pipe open before spawning children. Supposedly the same would happen here: the write end is kept open and only closed after the writable end of the pipe is passed to a child process.

Here I don't understand how it will work with the strategies.

And here I thought I got it all fleshed out. Bummer. Let's try the following then. Supposedly we could require that code within with test does not modify anything but test.verdict and test.comment, e.g. via test.rate or calling programs. Then the with test blocks can be skipped for tests that were judged before. This is not restrictive, because anything can be passed via the test comment. And if the scorer uses data that is not visible to the jury, then perhaps it's a bad scorer.

Yes. If Polygon will not support such format, you still wouldn't be able to use new features. If it will, the format doesn't matter, they can simply change it.
(and all quotes below)

So here's what I think about compatibility. Polygon is, so to speak, the default tool for problemsetters. I can make a somewhat better tool for the new standard, but there's the problem of familiarity and the baby duck syndrome. And if we want people to get used to it, we need it to support Polygon-style exporting too, so it'd be a perfect copy of Polygon plus something.

Of course, it'd be much better if geranazavr555 or someone else on the Polygon development team reassured us that they are contemplating implementing the proposal. The reason I proposed a compatible format is that I saw their reluctance here and there considering other changes before, and I wanted to simplify their work; alas, this didn't seem enough.

Honestly, I feel like banging my head against the wall now. You ask for more radical changes, which I would happily support, but I have to rely on other people to make it work, but they seem to think even this is too radical. It feels like I tried everything and nothing worked and it's all due to people's characters rather than any technical reason and I can make it work but they wouldn't let me and-- sorry, got carried away here.

This is not restrictive, because anything can be passed via the test comment.

Hmm, OK, that idea looks good to me. The most important here is to preserve possibility to implement something like group dependencies, but, if we can use test.verdict and modify some global state outside of with test:, it should work.

they seem to think even this is too radical

I think, there are two different directions of changes, which are orthogonal. The first one is functionality (like strategies), and implementation of this one may take months (and possibly may require rewriting everything from the scratch). The second one is syntax sugar (like XML versus anything-modern, deprecation of outdated features, etc) and it's relatively easy to implement (and it's unlikely that some “core” code needs to be modified at all).

Anyway, my point was not about “this is all useless”, but just about “it doesn't seem production-ready for me” (and, in fact, we found some issues, which is good!). I still think the overall idea is great.

~~Besides, how else are you going to add system tests during the contest without anyone noticing?~~ :)

PavelKunyavskiy

+16

Not sure, what you are talking about run-twice in tags, but I think it just used as tag, and doesn't have any meaning.

For now, I think runtwice in mentioned in polygon's problem.xml in two places

<judging cpu-name="Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz" cpu-speed="3600" input-file="" output-file="" run-count="2" treat-points-from-checker-as-percent="true">

And this is meaningful, while a tag is just a tag.

Yes, I know that, and both of these are mentioned and used in the specification. Like I said in the very comment you replied to,

Archives generated by Polygon nowadays don't require the run-twice tag and specify this elsewhere. This tag shouldn't be used as a decisive factor nowadays, it's just for compatibility with some existing packages.

so this is only for compatibility with ROI packages created before Polygon added support for the feature.

purplesyringa's blog