How to solve Quora Haqathon ontology and wombats?

Блог пользователя speedy03

Автор speedy03, 9 лет назад, По-английски

The contest is over. I guess it's OK to discuss the problems here.

It appears Ontology is not very difficult to most people. However, I don't even have a smart way to parse the parenthese-expression. All I did is just to generate a huge hash and transverse the relevant elements, and apparently it resulted in TLE. I wonder what kind of tree structure is recommended?

I have the similar issue in best representing the problem with appropriate data structure in Wombats too. Even if I tried randomized optimization, it is hard for me to construct a stuffed animal removal sequence.

What categories of algorithms do these two problems belong to?

quora

speedy03
9 лет назад
21

Комментарии (21)

Написать комментарий?

I_love_Tanya_Romanova

9 лет назад, # |

← Rev. 2 →

+10

Wombats can be solved with max.flow. Build a graph of dependencies (taking i forces us to take j) and solve a closure problem for it.

There are lot of ways to solve Ontology, i did it offline — storing map of prefixes (with hashes instead of original strings) for a subtree and solving all queries in single dfs.

→ Ответить

1e9

9 лет назад, # ^ |

I did in the same way for Ontology problem but instead of storing hashses I stored the string so TL for stress test.

→ Ответить

Enchom

9 лет назад, # ^ |

I feel stupid. Not once in my life have I seen the closure problem, and googling it, it seems to be a somewhat standard problem. I guess there'a always more to learn.

→ Ответить

YourRatzon

8 лет назад, # ^ |

← Rev. 2 →

For Ontology, can you elaborate a little bit, how you answered the queries? I tried your approach, and during the DFS I merged all of the children's unordered_maps with the parent's, and then answered the queries for that parent. This worked on first 75 test cases, but timed out on the last 25.

→ Ответить

ibra

9 лет назад, # |

-11

As for me, the most difficult problems were the machine learning problems. I got max score on all other ones.

BTW 3 hours before the end of the contest I was 2nd (in prize section). And then suddenly some rejudge happened and I got 5th place. Just 10 points from 3rd place. And 3 hours was really not enough for me to get them. All I want to say, is that doing rejudge 3 hours before week-long contest is not reasonable thing to do

P.S. To tell the truth, if I used some standard Machine Learning algorithm (Neural Network or SVM or smth else) instead of implementing som algorithmic ideas I would probably get much higher score.

→ Ответить

ItsLastDay

9 лет назад, # ^ |

Using standard ML algorithms took me to top-10 on both problems. Though I doubt my solutions got any usage pontential.

For the rest of the contest, wombats were too hard for me. I even saw I_love_Tanya_Romanova's comment about using maxflow before the contest was extended, but still did not manage to find a solution.

→ Ответить

k0st1a

9 лет назад, # ^ |

What were your standard algorithms? For instance, in Labeler problem I tried to learn binary classifiers for every topic based on bigrams, however it brought me slightly more than 100 pnts. With respect to Duplicate, it looks like idea with local sensitive hashing and/or any metric for text comparison + learning of weights(how much every field is important) may work, however I did not have time to submit it. I spent more time on algorithmic problems. (Managed to solve all of them, but I guess it took me more time than for you guys).

Btw, here is my code for Ontology problem. I used persitent segment trees to solve it.

→ Ответить

I_love_Tanya_Romanova

9 лет назад, # ^ |

← Rev. 2 →

I participated only in 8h version of contest, therefore my approach for Labeler is pretty straightforward.

For every word in input count it frequency and frequency of every topic in all questions where this word occurs (so you'll know how frequent given word is, and how frequent given topic is for given word); then throw away 40-50 most frequent words (because they are "I", "is", "the" etc., and they don't provide any information), and for all other pairs word/topic calculate value occurrences[topic][word]/entries[word] — this value is a score of given topic for every occurence of given word in a query.

After that for every query you just sum up scores of all topics over all words in this query and pick top10. This gives a bit more than 118 points.

→ Ответить

ItsLastDay

9 лет назад, # ^ |

← Rev. 2 →

I did not solve 'Sorted set', by the way, so overall my points are not good.

Algorithm for duplicate: I transform each question into a string "question_text" + names of topics + "view_count" + "follow_count" + "age". Each pair of questions (training and testing data) is represented as concatenation of strings for individual questions. Then I convert strings into Tf-idf vectors and do Logistic Regression on them. 140.60 points code.

Algorithm for labeler: Each string is converted into Tf-idf vector. I considered constant (250) potential labels for each question. After reading all questions, all useless (not appearing in any set of labels) labels are removed. Then Logistic Regression is applied in OneVsRest strategy (n_labels * (n_labels — 1) / 2 binary classifiers are trained to decide between each pair of labels). After that, some number of labels (based on estimated probability) are written as output. Strangely enough, the best number of them was exactly one label per question. 130.0 points code.

In both problems, I've played with parameters (of Tf-idf, LogR and feature selection) to achive maximum scores, without respect to any 'common sense'. Python's sklearn library provides good implementation of algorithms, so the code is short and easy to write (during the initial 8-hour contest I got 134.67 and 122.63 respectively).

By the way, Logistic Regression was used mainly because it supports sparse matrix representation returned by TfidfVectorizer, and also allows to count probabilities for each class.

→ Ответить

ibra

9 лет назад, # ^ |

good. That was exactly what I meant. But for me using Neural Network for Labelers seems more logical (than Logistic Regression). I will try it now and see the result

→ Ответить

ibra

9 лет назад, # ^ |

← Rev. 2 →

I got 130(21th place) on Duplicate and 120(9th place) on Labeler.

And the sad thing is that totally random solution gets 100 on Duplicate

→ Ответить

adamant

9 лет назад, # |

Ontology can be solved online in O(n) with persistent trie and Euler tour technique.

→ Ответить

Enchom

9 лет назад, # ^ |

← Rev. 2 →

Understanding and coding that solution should be the definition of masochism, but a great one nonetheless :D

→ Ответить

speedy03

9 лет назад, # ^ |

Mind sharing the code please? Thanks

→ Ответить

adamant

9 лет назад, # ^ |

I haven't coded it when wrote that comment but well... Just for you, enjoy :) #Z5Qqn9

But now it takes about $\text{[math]}$ memory and don't pass all the tests :(
I believe persistent array in nodes of trie could help us make it $\text{[math]}$ but I definitely don't want to write it :)

→ Ответить

k0st1a

9 лет назад, # ^ |

I did with persistent segment tree. How to do it in O(n)?

→ Ответить

adamant

9 лет назад, # ^ |

In Euler tour all descendants of node form continuous subsegment. With persistent trie you can answer the queries on the prefixes. So the answer is get(r) - get(l - 1).

→ Ответить

k0st1a

9 лет назад, # ^ |

Can you kindly share your code?

→ Ответить

adamant

9 лет назад, # ^ |

Watch the comment above :)

→ Ответить

CountZero

9 лет назад, # |

However, I don't even have a smart way to parse the parenthese-expression

there's some pseudocode:

stack<vector<string>> ST;

match topic:
  case "(": push new empty vector into stack
  case ")": pop vector from stack; for each X in it set parent[X] = ST.head.back
  default: push topic into ST.head

As for main problem: sort queries by distance of its topic from root; build trie for current topic; merge it with children topics (at each step merge lesser trie into larger); answer query.

I think Schedule is more curious problem. I came up with proper comparator (t₁ + p₁·t₂ > t₂ + p₂·t₁), but how to prove its correctness? Or there is obvious O(n^2 or 3) solution?

→ Ответить

Alex_2oo8

9 лет назад, # |

I've solved the Wombats problem without using flows. The solution is dynamic programming.

Lets divide our tetrahedron into layers. For example, on the image above, the first layer consists of highlighted vertices, the second — of two zeroes on the bottom level and -2 on the blue level, third layer — the only red 1.

So, the layers are 2-dimensional pyramids with heights N, N - 1, ..., 1. Lets see what are the possible subsets of vertices that can be taken from one layer. Lets say our layer is 2D pyramid with height H, then we can divide the pyramid into H columns — first column has size H, second — (H - 1) and so on. Then every possible subset can be described with a non-increasing sequence H ≥ a₁ ≥ a₂ ≥ a₃ ≥ ... ≥ a_H, where a_i means that we have taken a_i topmost vertices from the i-th column. Lets call such a sequence a state of that layer.

The state of the next (smaller) layer depends only on the state of previous (larger) layer and the only condition that should be fulfilled (in addition to the non-increasing condition) is a_i ≤ b_i for every 1 ≤ i ≤ H, where a_i is the state of layer H and b_i is the state of layer (H + 1).

So, we can have a dynamic programming solution with S₁ + S₂ + S₃ + ... + S_N states, where S_k denotes number of possible states on the layer k. I suppose that S_k grows exponentially, thus we receive O(S_N) states. With another (very simple) DP we can find that S₁₂ < 10⁶, so, this is efficient enough.