rezzaque's blog

By rezzaque, history, 8 years ago, In English

Hello everyone! I encountered a problem at work. I have to search from 1000000 strings that are built from only 0's and 1's if any of them is at least %80 percent similar to an arbitrarily given string. By similarity I mean hamming distance. The thing is linear searching is no good enough to make it in at least 0.2 seconds. All strings are of the same length which is 64 characters. Any help or suggestion is welcome, have a nice day!

  • Vote: I like it
  • +4
  • Vote: I do not like it

| Write comment?
»
8 years ago, # |
  Vote: I like it +2 Vote: I do not like it

store strings as 64-bit numbers, Hamming distance between strings a and b will be popcount(a xor b). I believe this should fit in 0.2s.

  • »
    »
    8 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    But the strings are given as bits, how do I store them as 64 bit numbers?

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      I mean, if the strings are just 1s and 0s, then it is simply a number in binary. Convert your strings into decimal numbers (64 bit integer type) using basic math. For example, say your dictionary of a million strings contains this string: "0011111000100011001100001110100011001000110011101000011110100111"

      This converts to 4477476230895929255. All 64 bit numbers will fit inside an unsigned long long in C/C++. Then, just do as CountZero said and do the xor trick to sweep through all million numbers. Should be very speedy.