ayushmishra.iit's blog

By ayushmishra.iit, history, 6 years ago, In English

I have a string of length n. I could have queries like query(i, j) which should return the hash of the sub-string starting at i and ending at j inclusive. This query should be answered in O(1). And a preprocessing of o(n) or o(n log n) is allowed initially.

Can anyone give me idea how to do such thing?

 
 
 
 
  • Vote: I like it
  • -4
  • Vote: I do not like it

»
6 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Apparently, it's problem from a live contest. It wouldn't be appropriate for me to answer.

»
6 years ago, # |
Rev. 3   Vote: I like it -20 Vote: I do not like it

This is related to a problem of the ongoing codechef june challenge CHSTR . This might be a pure coincidence, but please do not ask questions pertaining to live contest problems.

  • »
    »
    6 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I'm sorry then. I didn't know it was.

  • »
    »
    6 years ago, # ^ |
      Vote: I like it +68 Vote: I do not like it

    What are you talking about? Substring hash is very well-known and basic algorithm, it is OK to talk about it despite coincidence with any ongoing contest. Will you try to ban any discussion of segment tree if it will be used in some problems from Long challenge?

    • »
      »
      »
      6 years ago, # ^ |
      Rev. 2   Vote: I like it -12 Vote: I do not like it

      I was just putting forth my views. It seemed a bit odd to me that what this post is asking is really similar to what the problem asks. It could, or couldn't be a coincidence.

      And, this guy is, indeed, attempting the very same problem: Submission History

»
6 years ago, # |
  Vote: I like it +34 Vote: I do not like it

I can do it with O(1) preprocessing time: Define a hash function which hashes every string to 0. Then the hash-function is given by: f(i, j) = 0.

»
6 years ago, # |
Rev. 6   Vote: I like it +27 Vote: I do not like it

You can have a polynomial hash of all prefixes. That is, you store:

F[i] = [ a1*B^(i-1) + a2*B^(i-2) ... ai*B^0 ] % MOD

This is the most common hashing function and you can calculate all F[i] from 1 to N in O(N) time. Then suppose we want to get H(i,j) that returns the hash of the sequence in interval [i; j]. Well we have

F[j] =    [ a1*B^(j-1) + a2*B^(j-2) ... aj*B^0 ] % MOD
F[i-1] =  [ a1*B^(i-2) + a2*B^(i-3) ... a[i-1]*B^0 ] % MOD

Now it's easy to see that:

F[i-1]*B^(j-i+1) = [ a1*B^(j-1) + a2*B^(j-2) ... a[i-1]*B^(j-i+1) ] % MOD

So finally, we can subtract that from F[j] and we get the proper hash. The final answer, if I'm not mistaken, is

H(i,j) = ( F[j]-F[i-1]*B^(j-i+1) ) % MOD

Having all F[] calculated and keeping a precomputed array with the powers of B will give you O(N) preprocessing time and O(1) query time.

  • »
    »
    6 years ago, # ^ |
    Rev. 2   Vote: I like it -11 Vote: I do not like it

    Deleted

  • »
    »
    6 years ago, # ^ |
      Vote: I like it +7 Vote: I do not like it

    Searching for such kind of explanation ! Thanks a lot Enchom

  • »
    »
    6 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Can you focus on what values we should take as B and MOD?

    • »
      »
      »
      6 years ago, # ^ |
        Vote: I like it +15 Vote: I do not like it

      As you probably know, B is your base and MOD is the modulo you work with. There are a few rules I have when choosing such values. The idea of hashing is to give you a fairly random number out of a string, and with minor changes in the string to produce very different hash number. Here are the rules I use:

      1) Make sure B and MOD are both prime numbers. This helps alot in producing "random" hashes out of strings and is just generally a must.

      2) Make your MOD sufficiently large. I usually use 10^9+7 or 10^9+9 as those are large known primes. The reason I use those values and not larger is because it allows multiplying without overflowing long long

      3) If you have a small alphabet, make B larger than your largest value. I don't think this is necessary and I do not know if it helps at all, but I've been told it's nice.

      A famous hashing technique is getting rid of MOD and using unsigned long long, as overflow will work as modulo. This is an amazing technique and works both fast and well, with the sad exception that there are found tests that break it and give alot of collisions. So while it looks like a nice alternative, in lots of online competitions anti-hash tests will exist.

      And finally I want to focus on an important thing when hashing, and that is avoiding collisions. Suppose you have one hash modulo 10^9+7, and in a problem you generate 100,000 hashes of different strings, and then want to check if there are duplicates. Surprisingly, according to the birthday problem your chance of having a collision among those 100,000 hashes is actually huge. In such cases, using double or triple hashing might be a good idea, even though it might significantly slow down the runtime.