Unfortunately, most derivations of the chance of polynomial hashing collision are invalid, wrong, or misleading, and finding reliable public sources with proofs is incredibly difficult. This article is a formal analysis of the method. The goal of this article is to complement well-known empirical facts with theory, provide boundaries on the probability of collision, justify common choices, and advise against certain popular parameters.
There is no code in this article, and it's mostly theoretical. It's more of a summa of everything we know about polynomial hashing in integers rather than a guide for beginners. You'll probably find something of interest still. I do, however, describe some methods of cracking hashes, which I can provide code and additional explanation for if someone asks in the comments and some general guidelines in the last section of the present article.
Table of contents
- The concept of polynomial hashing
- Classical justification of the coprimality requirement
- The pitfall
- Probability of size-bounded collision
- Probability of value-bounded collision
- Back to basics
- Conditions of surjectivity
- Handling lengths
- Side note on coprimality
- The two scenarios
- Reversal
- Thue-Morse sequence
- Attempt to reuse Thue-Morse sequence for other moduli
- Birthday paradox
- Lower bounds for anti-hash tests
- Particularities
- Key takeaways