Блог пользователя ayushmishra.iit

Автор ayushmishra.iit, история, 9 лет назад, По-английски

I have a string of length n. I could have queries like query(i, j) which should return the hash of the sub-string starting at i and ending at j inclusive. This query should be answered in O(1). And a preprocessing of o(n) or o(n log n) is allowed initially.

Can anyone give me idea how to do such thing?

  • Проголосовать: нравится
  • -4
  • Проголосовать: не нравится

»
9 лет назад, # |
Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

Apparently, it's problem from a live contest. It wouldn't be appropriate for me to answer.

»
9 лет назад, # |
Rev. 3   Проголосовать: нравится -20 Проголосовать: не нравится

This is related to a problem of the ongoing codechef june challenge CHSTR . This might be a pure coincidence, but please do not ask questions pertaining to live contest problems.

  • »
    »
    9 лет назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    I'm sorry then. I didn't know it was.

  • »
    »
    9 лет назад, # ^ |
      Проголосовать: нравится +68 Проголосовать: не нравится

    What are you talking about? Substring hash is very well-known and basic algorithm, it is OK to talk about it despite coincidence with any ongoing contest. Will you try to ban any discussion of segment tree if it will be used in some problems from Long challenge?

    • »
      »
      »
      9 лет назад, # ^ |
      Rev. 2   Проголосовать: нравится -12 Проголосовать: не нравится

      I was just putting forth my views. It seemed a bit odd to me that what this post is asking is really similar to what the problem asks. It could, or couldn't be a coincidence.

      And, this guy is, indeed, attempting the very same problem: Submission History

»
9 лет назад, # |
  Проголосовать: нравится +34 Проголосовать: не нравится

I can do it with O(1) preprocessing time: Define a hash function which hashes every string to 0. Then the hash-function is given by: f(i, j) = 0.

»
9 лет назад, # |
Rev. 6   Проголосовать: нравится +27 Проголосовать: не нравится

You can have a polynomial hash of all prefixes. That is, you store:

F[i] = [ a1*B^(i-1) + a2*B^(i-2) ... ai*B^0 ] % MOD

This is the most common hashing function and you can calculate all F[i] from 1 to N in O(N) time. Then suppose we want to get H(i,j) that returns the hash of the sequence in interval [i; j]. Well we have

F[j] =    [ a1*B^(j-1) + a2*B^(j-2) ... aj*B^0 ] % MOD
F[i-1] =  [ a1*B^(i-2) + a2*B^(i-3) ... a[i-1]*B^0 ] % MOD

Now it's easy to see that:

F[i-1]*B^(j-i+1) = [ a1*B^(j-1) + a2*B^(j-2) ... a[i-1]*B^(j-i+1) ] % MOD

So finally, we can subtract that from F[j] and we get the proper hash. The final answer, if I'm not mistaken, is

H(i,j) = ( F[j]-F[i-1]*B^(j-i+1) ) % MOD

Having all F[] calculated and keeping a precomputed array with the powers of B will give you O(N) preprocessing time and O(1) query time.

  • »
    »
    9 лет назад, # ^ |
    Rev. 2   Проголосовать: нравится -11 Проголосовать: не нравится

    Deleted

  • »
    »
    9 лет назад, # ^ |
      Проголосовать: нравится +7 Проголосовать: не нравится

    Searching for such kind of explanation ! Thanks a lot Enchom

  • »
    »
    9 лет назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    Can you focus on what values we should take as B and MOD?

    • »
      »
      »
      9 лет назад, # ^ |
        Проголосовать: нравится +15 Проголосовать: не нравится

      As you probably know, B is your base and MOD is the modulo you work with. There are a few rules I have when choosing such values. The idea of hashing is to give you a fairly random number out of a string, and with minor changes in the string to produce very different hash number. Here are the rules I use:

      1) Make sure B and MOD are both prime numbers. This helps alot in producing "random" hashes out of strings and is just generally a must.

      2) Make your MOD sufficiently large. I usually use 10^9+7 or 10^9+9 as those are large known primes. The reason I use those values and not larger is because it allows multiplying without overflowing long long

      3) If you have a small alphabet, make B larger than your largest value. I don't think this is necessary and I do not know if it helps at all, but I've been told it's nice.

      A famous hashing technique is getting rid of MOD and using unsigned long long, as overflow will work as modulo. This is an amazing technique and works both fast and well, with the sad exception that there are found tests that break it and give alot of collisions. So while it looks like a nice alternative, in lots of online competitions anti-hash tests will exist.

      And finally I want to focus on an important thing when hashing, and that is avoiding collisions. Suppose you have one hash modulo 10^9+7, and in a problem you generate 100,000 hashes of different strings, and then want to check if there are duplicates. Surprisingly, according to the birthday problem your chance of having a collision among those 100,000 hashes is actually huge. In such cases, using double or triple hashing might be a good idea, even though it might significantly slow down the runtime.