Problem Database — Idea for a CP-related project

3 years ago, # ^ |

+82

Um... what here can be interpreted as against Codeforces policies? Proposing paid job?

→ Reply

3 years ago, # |

+133

I would say it's very hard. Even the subproblem to parse two statements and say if they are similar already requires shooting some state-of-the-art cannon at it. Not to mention the software engineering effort required to create such a database of problems...

The only way in which I see it remotely doable, is to manually categorize thousands of problems based on difficulty, algorithms / operations used and give the admin option to search through some problems which roughly fall in the same bucket.

Also, if this tool became publicly available it would have some serious implications for the entire competitive programming...

→ Reply

3 years ago, # ^ |

← Rev. 2 →

+35

Yes, but I thought "shooting some state-of-the-art cannon" is not that hard in ML world, but I know nothing about ML and very much might be wrong. There already were some attempts (actually, let's ping AlexSkidanov) in SOLVING cp problems using ML, categorizing sounds much easier and much more ML-like for me, but again, I know nothing about this field.

to manually categorize thousands of problems based on difficulty, algorithms / operations used

We already have huge amounts of data — problemset of CodeForces and other platfroms with manually set tags and editorials.

→ Reply

sslotin

3 years ago, # ^ |

+87

NEAR failed with that and pivoted. They are now yet another crypto startup.

→ Reply

3 years ago, # ^ |

+23

I know that they are doing blockchain now, my point is that 3 years ago people with some real knowledge in ML tried to do something harder, so maybe it's not that impossible as one might think.

→ Reply

3 years ago, # ^ |

+31

It's a common misconception. Either you can use something fully out-of-the-box, or you're going to suffer some serious pain.

In the world of transfer learning, you'd start with a huge pretrained model that's capable of doing something (like recognizing objects), and your goal as an ML engineer is to fine-tune it to your application. But this isn't easy in our case, as the ability of reading is insufficient to solve competitive programming problems ~~as I demonstrate by not being red~~.

Starting with Codeforces database and our tags, one could create a fairly simple system which searches in some range of problem difficulties for tasks which have the largest number of similar tags. But the sad truth is that 'DP + 2200' is probably not going to be useful to you as a contest admin, because you'd have to manually search through a ton of problems (so the problems would have to be more carefully categorized for it to be useful). For anything beyond that, the difficulty of creating such a system rises exponentially, I'm afraid :(

→ Reply

BullyMaguire

3 years ago, # ^ |

-94

if you dont know anything about ML then you might need to think twice before suggesting projects related to it

→ Reply

BlueTulpis

3 years ago, # ^ |

and you should think twice before commenting something

→ Reply

BullyMaguire

3 years ago, # ^ |

-47

oh really. my downvotes are from umnik idol worshippers, not from level headed people

→ Reply

3 years ago, # ^ |

+42

Fun fact: almost every time I have seen someone saying "I got downvoted because X" it's fairly clearly at least partially wrong.

→ Reply

BullyMaguire

3 years ago, # ^ |

-32

cant resist the need to seek contribution at any opportunity can u

→ Reply

3 years ago, # ^ |

← Rev. 2 →

+21

case in point (more or less).

A comment that gets about +15 points has basically zero effect on my contribution. So no, I write such comments only because I want to say something.

→ Reply

AlexSkidanov

3 years ago, # ^ |

+66

One thing we've done back then was pay people to rewrite statements into short mathematical form. I might be able to excavate that dataset.

On the full descriptions with "Anna was given a string for her birthday" what you want is probably not possible. With short mathematical descriptions something like finding 10 most similar problems could be within reach.

But to your point, back then we might have been somewhat dilusional about what's possible, so our attempts from 2017 are no indication of what is possible today.

→ Reply

afedor

3 years ago, # ^ |

← Rev. 3 →

+69

Maybe it should parse not statements, but solutions. This way no experts are required to manually label problems.

When statements are similar, problems can still be quite different. However, similar solutions mean similar problems.

→ Reply

OleschY

3 years ago, # |

← Rev. 2 →

+39

Something like OEIS for CP-Problems would be really cool!

The defining structure of a problem is some function $$$P$$$ with $$$P(input)=output$$$.

I imagine a database of problems and solutions. We feed it some $$$input$$$ and some $$$output$$$ and it checks and returns all problems $$$P$$$ for which $$$P(input)=output$$$.

Spoiler

→ Reply

3 years ago, # ^ |

← Rev. 2 →

+19

This has a fundamental problem of assuming that the equivalent tasks have the same input and output. I'd say only in few cases it will be true. But predominantly, only the idea to solve it will be the same.

→ Reply

OleschY

3 years ago, # ^ |

← Rev. 3 →

How do you mean? Do you mean the general formatting of the input and output or do you mean the explicit content of some input and output? The former I agree, that's what I commented in the spoiler. For the latter, I don't propose matching the content of inputs and outputs in the database. I propose a database of functions $$$P$$$ and we check for each $$$P$$$ in the database whether it yields $$$P(input)=output$$$. (Which is of course computation-intensive. But after all, I'm just daydreaming.)

Or maybe I just misunderstand your concern?

→ Reply

3 years ago, # ^ |

Ah, sorry, I didn't see your spoiler :)

→ Reply

epsilon_573

3 years ago, # ^ |

Maybe this is where categorisation can help. You can reduce the computation by tag filtering. Also if someone is trying to steal a problem, they can always shuffle up the input variables. That will be hard to resolve.

→ Reply

pllk

3 years ago, # ^ |

← Rev. 2 →

+34

I believe this would be a good way to create the database, and not too difficult to implement well. Indeed, OEIS would be a good model for this project.

To make the search more efficient, we could also have some canonical inputs, such as some fixed list of numbers for all problems whose input is a list of numbers.

→ Reply

randuser_8889997773

3 years ago, # |

← Rev. 3 →

-96

are you joking? 1000$?

first of all the projects in nlp are pretty hard and that is why there are lots of companies which pay way way more to help them with nlp problems , and secondly if some one can create such thing why in the world he/she should waste it on CP? you can use such tool for way way more things and become rich. at this point and with the available methods doing this job with 1k$ is just a joke. also just considering required hardware make it even more like a dream.

→ Reply

Mohamed.Sobhy

3 years ago, # ^ |

+48

he is not rich

→ Reply

3 years ago, # ^ |

+51

I'm saying that I'm ready to pay $1000 with my own money. Maybe platforms with more funds will also be willing to pay for such a project, I can't say for them.

→ Reply

mtw

3 years ago, # ^ |

+33

You can start from simpler things. And no, no one will be doing this for 1k$, you can look at um_nik's offer as some indicator that people need it.

→ Reply

rgnerdplayer

3 years ago, # ^ |

+45

Imaging registering a new account just to critize somebody.

→ Reply

dpaleka

3 years ago, # |

← Rev. 2 →

+17

If the solutions/submissions are publicly available, I'd bet it's much easier to detect isomorphic AC solutions than the texts. Do we have formal verification experts here?

→ Reply

3 years ago, # ^ |

I would prefer not to implement model solution to a problem before approving it, and then I would prefer not to send it to open project for everyone to see.

→ Reply

dpaleka

3 years ago, # ^ |

The first issue is a tough one :)

The second issue can be solved by running the queries locally. Probably even without downloading the whole database, and probably even without having interpretable codes in the database in the first place.

→ Reply

nocriz

3 years ago, # ^ |

Ask the problem proposer to write the model solution

→ Reply

xiaolaogou

3 years ago, # |

← Rev. 2 →

If some machine can interpret the intrinsic meaning of the problem, then it almost equals to that machine will be able to solve the problem.

Comparing the solutions (source code of different problem statements), I think it will be more feasible.

→ Reply

Lalalalllalalalaaaaaa

3 years ago, # ^ |

+11

tourist is the name of that machine

→ Reply

gultai4ukr

3 years ago, # ^ |

If some machine can interpret the intrinsic meaning of the problem, then it almost equals to that machine will be able to solve the problem.

Not really, second is way much harder than a first, but I definitely agree that even first is already far beyond what we can expect from AI in a near future.

→ Reply

sslotin

3 years ago, # |

← Rev. 2 →

+148

I teach machine learning to undergrad students and I've actually worked in the ranking team of a search engine. I've had a lot of students with CP background, and naturally I tried to nudge them towards picking a CP-related ML course project like this one. This has never worked a single time in 3 years, but I don't lose hope :)

So I've thought about this problem for a while. For a search engine, there are three types of data that can be used:

Problem statements are freely available, but I'm fairly certain that using just them won't work, at least not with reasonable accuracy (maybe with a $10M budget for model development and data labeling it will).
Short atCoder-style formal statements and editorials may actually work, but they are not always available and in most cases need to be processed by hand.
Problem solutions are probably the best approach here as it is fairly easy to check if two programs solve the same or similar problems, and so the search can be done very accurately with the source code instead of the problem statement. I think it is possible to negotiate with main OJ admins to get them (on the terms that they become available to only a limited group of people developing the search engine and not become public).

Creating a model for searching by a solution source should be relatively easy, and what we can do to generalize it to search by a problem statement is to train a separate model that (inaccurately, because the data is scarce) turns a short problem statement into the same vector representation as the average source code of its solution.

In terms of funding, I think it's a bad idea to offer a small amount. In my opinion, the project should be either crowdfunded with a reasonable market-salary budget (in the 10-20k range) or fully crowdsourced (that is, nobody gets paid, but a lot of people need to help with data labeling and other maintenance stuff).

→ Reply

Kyou_mo_kawaii

3 years ago, # ^ |

it is fairly easy to check if two programs solve the same or similar problems

Can you point me to some resources or what keywords to google? I only found results that states that the problem is undecidable.

→ Reply

Ritwin

3 years ago, # ^ |

I believe sslotin meant that it's easy to check in a similar way to the plag checker. We're not checking that they do the exact same thing, just that the solutions are similar enough. For example, taking in input $$$K\ N$$$ instead of $$$N\ K$$$, the overall solution could be exactly the same, but with

std::cin >> N >> K;

std::cin >> K >> N;

Also, there can be very different-looking problems that have very similar solutions. I remember this happened in a round a few months ago, but I can't recall which one. Other than the input/output format, one problem could be directly translated into the other. However, if you just look at the problem statement / input / output, you won't see any correlation.

→ Reply

subscriber

3 years ago, # |

+19

You want to automate part of your work as cp admin. Understandable, but if you think, there are not a lot of other cp admins who would also benefit from this solution, for other people like coaches or teachers it might help, but wouldn't recude their work dramatically.
If you're determined with this idea, I'd suggest to look into more generic solution (for papers, patents, news?), with a custom application for cp problem search. In this case it would be possible to get founding and most likely such tech would worth millions.

→ Reply

rav.gupta

3 years ago, # |

+13

It would most likely be a multi hop reasoning model, where each solution (algorithm) will be represented by a vector. Look into OpenAI GPT-3 model and its capabilities Link to GPT-3. Also It would be a good idea to check Neural Network Theorem solver Link to the article about theorem solver. Maybe a combination of the two would be suffice to represent each solution as a vector. Similar solutions would be closer in the vector space.

→ Reply

3 years ago, # |

+66

Reading some of the comments, I think any approach that involves making an AI that can read and understand problems is right out. That's simply not feasible.

The best that can be done is a respectable search engine with some domain-specific optimizations (modern search engines contain ML components of course but it is not really what I mean above). We have some advantages here. There is a lot of standard terminology and the search space is small, we can probably get hands on about 20k problems. If there is any more-or-less good open-source search toolkit (we don't need anything close to the state of the art), then this might actually be doable, but still require a dedicated team and a lot of time and money. But at this rate, I'm not sure if you'd actually get better results than googling and heavy filtering.

I think editorials are the best thing to search. Editorials often contain relatively formal and simple problem statements (i.e. no Vasya finding queries in the left pocket).

Here's the thing though. Very often we have problems that haven't exactly been proposed before, but you read it and you kind of have seen everything before. Here's an example of such a problem (and I felt the same way solving this problem). You'd want to avoid creating this feeling, but now you need to have some level of intelligence to be able to distill this problem to classical subproblems and somehow understand that the transition is simple enough for it to be considered a repeat of tha problem.

→ Reply

bicsi

3 years ago, # |

← Rev. 2 →

+89

Creating a CP problem aggregator was one of my biggest personal projects, and one idea of mine was to do that as well (amongst others like problem recommender, contest creator, custom interesting contest formats like lockout, etc.)

Contrary to what others believe, I actually think it’s doable; however, my idea would be to use submissions instead of statements to infer similarity between problems. I have done some work in the past years in that sense, but nowadays I just don’t have the time to integrate it into the platform. I might do that once I free myself some time. In the mean time, if someone is willing to put effort in the project and wants to discuss ideas and implementation, feel free to PM me.

The platform at its current state is at https://binarylift.com (it has some bugs, so you might prefer to check out the old slower version at https://competitive.herokuapp.com)

→ Reply

TiredOfABCs

3 years ago, # ^ |

where is the option to add CF handle in binarylift.com ?

→ Reply

bicsi

3 years ago, # ^ |

Once you’ve created an account, you can go to your profile and link your OJ accounts there. The submissions should be updated in some time (it might take a while nowadays as there are quite a few accounts on the platform and it doesn’t scale extraordinarily well).

→ Reply

kostka

3 years ago, # |

+30

I'm interested in supporting such a project as well (financially too!).

→ Reply

yistarostin

3 years ago, # |

I guess this might be possible to implement such a model only in a world, where authors don’t write problem legends… Of course, such a world doesn’t exist, as problem legends are great and essential.

→ Reply

nikgaevoy

3 years ago, # |

+18

Isn't that the type of problem that the tags are proposed to deal with? It is much easier to make better tags in, say, CF, than to make a full search engine of this kind.

Also, if you want just a full-text search engine, without the internal meaning thing, then in that case, you can use any search engine, like Google, really. Recall how search in cplusplus.com works, for example. If you want a full-text search over problems with some specific tag, which seems to be the main use case, it is also possible with proper database design and some search line magic.

And, tbh, I doubt that it is possible to do much more than that.

I also feel that we really need something like new e-maxx, with good modern algorithms, good reliable code in modern C++, where usability, extensibility and readability are preferred over brevity (not the case of e-maxx at all), and example problems on each topic (just like in e-maxx).

I think this is the thing that may be worth paying for, because it is clear for me that crowdsourcing is simply impossible here (well, except for some basic topics, but they are not the point), and that no one would really do it all by himself.

→ Reply