Блог пользователя rezzaque

Автор rezzaque, история, 8 лет назад, По-английски

Hello everyone! I encountered a problem at work. I have to search from 1000000 strings that are built from only 0's and 1's if any of them is at least %80 percent similar to an arbitrarily given string. By similarity I mean hamming distance. The thing is linear searching is no good enough to make it in at least 0.2 seconds. All strings are of the same length which is 64 characters. Any help or suggestion is welcome, have a nice day!

  • Проголосовать: нравится
  • +4
  • Проголосовать: не нравится

»
8 лет назад, # |
  Проголосовать: нравится +2 Проголосовать: не нравится

store strings as 64-bit numbers, Hamming distance between strings a and b will be popcount(a xor b). I believe this should fit in 0.2s.

  • »
    »
    8 лет назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    But the strings are given as bits, how do I store them as 64 bit numbers?

    • »
      »
      »
      8 лет назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      I mean, if the strings are just 1s and 0s, then it is simply a number in binary. Convert your strings into decimal numbers (64 bit integer type) using basic math. For example, say your dictionary of a million strings contains this string: "0011111000100011001100001110100011001000110011101000011110100111"

      This converts to 4477476230895929255. All 64 bit numbers will fit inside an unsigned long long in C/C++. Then, just do as CountZero said and do the xor trick to sweep through all million numbers. Should be very speedy.