How to solve Quora Haqathon ontology and wombats? - Codeforces

→ Pay attention

Contest is running
Codeforces Round 941 (Div. 1)
01:51:37

*has extra registration

Contest is running
Codeforces Round 941 (Div. 2)
01:51:37

*has extra registration

→ Streams

Codeforces Round 941 (Div 1 + 2) Solution Discussion (with Jan)

By Shayan

Before stream 01:56:37

View all →

→ Top rated

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

Countries | Cities | Organizations

→ Top contributors

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

speedy03's blog

How to solve Quora Haqathon ontology and wombats?

By speedy03, 9 years ago, In English

In English

The contest is over. I guess it's OK to discuss the problems here.

It appears Ontology is not very difficult to most people. However, I don't even have a smart way to parse the parenthese-expression. All I did is just to generate a huge hash and transverse the relevant elements, and apparently it resulted in TLE. I wonder what kind of tree structure is recommended?

I have the similar issue in best representing the problem with appropriate data structure in Wombats too. Even if I tried randomized optimization, it is hard for me to construct a stuffed animal removal sequence.

What categories of algorithms do these two problems belong to?

Tags

quora

+16

speedy03
9 years ago
21

Comments

Comments (21)

Write comment?

»

I_love_Tanya_Romanova

9 years ago, # |

← Rev. 2 →

Vote: I like it

+10

Vote: I do not like it

Wombats can be solved with max.flow. Build a graph of dependencies (taking i forces us to take j) and solve a closure problem for it.

There are lot of ways to solve Ontology, i did it offline — storing map of prefixes (with hashes instead of original strings) for a subtree and solving all queries in single dfs.

→ Reply

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

I did in the same way for Ontology problem but instead of storing hashses I stored the string so TL for stress test.

→ Reply

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

I feel stupid. Not once in my life have I seen the closure problem, and googling it, it seems to be a somewhat standard problem. I guess there'a always more to learn.

→ Reply

»

»

8 years ago, # ^ |

← Rev. 2 →

Vote: I like it

0

Vote: I do not like it

For Ontology, can you elaborate a little bit, how you answered the queries? I tried your approach, and during the DFS I merged all of the children's unordered_maps with the parent's, and then answered the queries for that parent. This worked on first 75 test cases, but timed out on the last 25.

→ Reply

»

9 years ago, # |

Vote: I like it

-11

Vote: I do not like it

As for me, the most difficult problems were the machine learning problems. I got max score on all other ones.

BTW 3 hours before the end of the contest I was 2nd (in prize section). And then suddenly some rejudge happened and I got 5th place. Just 10 points from 3rd place. And 3 hours was really not enough for me to get them. All I want to say, is that doing rejudge 3 hours before week-long contest is not reasonable thing to do

P.S. To tell the truth, if I used some standard Machine Learning algorithm (Neural Network or SVM or smth else) instead of implementing som algorithmic ideas I would probably get much higher score.

→ Reply

»

»

9 years ago, # ^ |

Vote: I like it

+5

Vote: I do not like it

Using standard ML algorithms took me to top-10 on both problems. Though I doubt my solutions got any usage pontential.

For the rest of the contest, wombats were too hard for me. I even saw I_love_Tanya_Romanova's comment about using maxflow before the contest was extended, but still did not manage to find a solution.

→ Reply

»

»

»

9 years ago, # ^ |

Vote: I like it

+3

Vote: I do not like it

What were your standard algorithms? For instance, in Labeler problem I tried to learn binary classifiers for every topic based on bigrams, however it brought me slightly more than 100 pnts. With respect to Duplicate, it looks like idea with local sensitive hashing and/or any metric for text comparison + learning of weights(how much every field is important) may work, however I did not have time to submit it. I spent more time on algorithmic problems. (Managed to solve all of them, but I guess it took me more time than for you guys).

Btw, here is my code for Ontology problem. I used persitent segment trees to solve it.

→ Reply

»

»

»

»

I_love_Tanya_Romanova

9 years ago, # ^ |

← Rev. 2 →

Vote: I like it

+3

Vote: I do not like it

I participated only in 8h version of contest, therefore my approach for Labeler is pretty straightforward.

For every word in input count it frequency and frequency of every topic in all questions where this word occurs (so you'll know how frequent given word is, and how frequent given topic is for given word); then throw away 40-50 most frequent words (because they are "I", "is", "the" etc., and they don't provide any information), and for all other pairs word/topic calculate value occurrences[topic][word]/entries[word] — this value is a score of given topic for every occurence of given word in a query.

After that for every query you just sum up scores of all topics over all words in this query and pick top10. This gives a bit more than 118 points.

→ Reply

»

»

»

»

9 years ago, # ^ |

← Rev. 2 →

Vote: I like it

+5

Vote: I do not like it

I did not solve 'Sorted set', by the way, so overall my points are not good.

Algorithm for duplicate: I transform each question into a string "question_text" + names of topics + "view_count" + "follow_count" + "age". Each pair of questions (training and testing data) is represented as concatenation of strings for individual questions. Then I convert strings into Tf-idf vectors and do Logistic Regression on them. 140.60 points code.

Algorithm for labeler: Each string is converted into Tf-idf vector. I considered constant (250) potential labels for each question. After reading all questions, all useless (not appearing in any set of labels) labels are removed. Then Logistic Regression is applied in OneVsRest strategy (n_labels * (n_labels — 1) / 2 binary classifiers are trained to decide between each pair of labels). After that, some number of labels (based on estimated probability) are written as output. Strangely enough, the best number of them was exactly one label per question. 130.0 points code.

In both problems, I've played with parameters (of Tf-idf, LogR and feature selection) to achive maximum scores, without respect to any 'common sense'. Python's sklearn library provides good implementation of algorithms, so the code is short and easy to write (during the initial 8-hour contest I got 134.67 and 122.63 respectively).

By the way, Logistic Regression was used mainly because it supports sparse matrix representation returned by TfidfVectorizer, and also allows to count probabilities for each class.

→ Reply

»

»

»

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

good. That was exactly what I meant. But for me using Neural Network for Labelers seems more logical (than Logistic Regression). I will try it now and see the result

→ Reply

»

»

»

9 years ago, # ^ |

← Rev. 2 →

Vote: I like it

0

Vote: I do not like it

I got 130(21th place) on Duplicate and 120(9th place) on Labeler.

And the sad thing is that totally random solution gets 100 on Duplicate

→ Reply

»

9 years ago, # |

Vote: I like it

+5

Vote: I do not like it

Ontology can be solved online in O(n) with persistent trie and Euler tour technique.

→ Reply

»

»

9 years ago, # ^ |

← Rev. 2 →

Vote: I like it

+5

Vote: I do not like it

Understanding and coding that solution should be the definition of masochism, but a great one nonetheless :D

→ Reply

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

Mind sharing the code please? Thanks

→ Reply

»

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

I haven't coded it when wrote that comment but well... Just for you, enjoy :) #Z5Qqn9

But now it takes about $\text{[math]}$ memory and don't pass all the tests :(
I believe persistent array in nodes of trie could help us make it $\text{[math]}$ but I definitely don't want to write it :)

→ Reply

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

I did with persistent segment tree. How to do it in O(n)?

→ Reply

»

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

In Euler tour all descendants of node form continuous subsegment. With persistent trie you can answer the queries on the prefixes. So the answer is get(r) - get(l - 1).

→ Reply

»

»

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

Can you kindly share your code?

→ Reply

»

»

»

»

»

9 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

Watch the comment above :)

→ Reply

»

9 years ago, # |

Vote: I like it

0

Vote: I do not like it

However, I don't even have a smart way to parse the parenthese-expression

there's some pseudocode:

stack<vector<string>> ST;

match topic:
  case "(": push new empty vector into stack
  case ")": pop vector from stack; for each X in it set parent[X] = ST.head.back
  default: push topic into ST.head

As for main problem: sort queries by distance of its topic from root; build trie for current topic; merge it with children topics (at each step merge lesser trie into larger); answer query.

I think Schedule is more curious problem. I came up with proper comparator (t₁ + p₁·t₂ > t₂ + p₂·t₁), but how to prove its correctness? Or there is obvious O(n^2 or 3) solution?

→ Reply

»

9 years ago, # |

Vote: I like it

+5

Vote: I do not like it

I've solved the Wombats problem without using flows. The solution is dynamic programming.

Lets divide our tetrahedron into layers. For example, on the image above, the first layer consists of highlighted vertices, the second — of two zeroes on the bottom level and -2 on the blue level, third layer — the only red 1.

So, the layers are 2-dimensional pyramids with heights N, N - 1, ..., 1. Lets see what are the possible subsets of vertices that can be taken from one layer. Lets say our layer is 2D pyramid with height H, then we can divide the pyramid into H columns — first column has size H, second — (H - 1) and so on. Then every possible subset can be described with a non-increasing sequence H ≥ a₁ ≥ a₂ ≥ a₃ ≥ ... ≥ a_H, where a_i means that we have taken a_i topmost vertices from the i-th column. Lets call such a sequence a state of that layer.

The state of the next (smaller) layer depends only on the state of previous (larger) layer and the only condition that should be fulfilled (in addition to the non-increasing condition) is a_i ≤ b_i for every 1 ≤ i ≤ H, where a_i is the state of layer H and b_i is the state of layer (H + 1).

So, we can have a dynamic programming solution with S₁ + S₂ + S₃ + ... + S_N states, where S_k denotes number of possible states on the layer k. I suppose that S_k grows exponentially, thus we receive O(S_N) states. With another (very simple) DP we can find that S₁₂ < 10⁶, so, this is efficient enough.

The transition of this DP is a bit tricky, but not hard, so I will omit it. I can say that the transition can be done in O(N²·S_N) time (overall).

PS My code: ideone. Its complexity contains additional $\text{[math]}$ factor that can be reduced, for example, using trie for fast state search.

→ Reply