sslotin's blog

By sslotin, 2 years ago, In English

I've got an interesting math problem.

Here is a binary search variant that picks the element to compare against at random instead of always picking the middle one:

int lower_bound(int x) {
    int l = 0, r = n; // [l, r]
    while (l < r) {
        int m = l + rand() % (r - l); // <- note that r is never picked because there's no point
        if (a[m] >= x)
            r = m;
        else
            l = m + 1;
    }
    return l;
}

Assuming that the keys and queries are also drawn independently at random, as $$$n \to \infty$$$, how many times more comparisons does the random binary search perform compared to the normal one?

Edit: the binary search used to return a value instead of index. This is not correct if the the lower bound doesn't exist.

  • Vote: I like it
  • +148
  • Vote: I do not like it

| Write comment?
»
2 years ago, # |
Rev. 5   Vote: I like it +100 Vote: I do not like it

I will use $$$1$$$-indexing in what follows. Also I'll assume that we count the number of element comparisons (since comparing $$$l, r$$$ is not memory intensive, so it is not that important anyway). If we also count comparisons of $$$l, r$$$, it is twice of what we are counting here, minus 1, so it won't end up changing the ratio at all.

Firstly note that we can assume that the underlying array is $$$1, \ldots, n$$$ (since we can compress elements, and for the sake of simplicity we can assume that the array consists of distinct elements). I suspect that the result should remain the same even when there are duplicate elements.

Let $$$a(l, r, q)$$$ be the expected number of element comparisons this algorithm does when you have a query $$$q$$$ in an interval $$$[l, r]$$$.

Then note that $$$a(l, r, q) = a(1, r - l + 1, q - l + 1)$$$.

Using this, if $$$b(n, q) = a(1, n, q)$$$, we can get a recurrence: $$$b(n, q) = \sum_{i = 1}^{q - 1} b(n - i, q - i) + \sum_{i = q}^{n - 1} b(i, q)$$$.

We want $$$c(n) = \mathbb{E}_q[b(n, q)] = \frac{1}{n}\sum_{q = 1}^n b(n, q)$$$, and the recurrence on summing for $$$q = 1$$$ to $$$n$$$ becomes $$$c(n) = 1 + \frac{\sum_{i = 1}^{n - 1} i c(i)}{n(n - 1)/2}$$$.

Let $$$d(n) := n c(n) + 1$$$. We then have $$$d(n)/n - d(n - 1)/(n - 1) = \frac{2n - 1}{n(n + 1)}$$$.

So we have $$$c(n) = 2H_n - 2$$$. So the answer to the problem is asymptotically $$$\frac{2 \ln n}{\log_2 n} = 2 \ln 2 \approx 1.386$$$.

Edit: I believe that the binary search given here works correctly if and only if $$$t[n - 1] = \infty$$$.

»
2 years ago, # |
Rev. 5   Vote: I like it 0 Vote: I do not like it

I thought it would be fun to solve it for a general probability distribution $$$P_n(i)$$$ over the choices, i.e., out of $$$[1, 2, ..n]$$$, $$$P_n(i)$$$ is the probability of picking $$$i$$$. In the above problem, $$$P_n(i) = \dfrac{1}{n-1} \forall i \in [1 ..n)$$$ and $$$0$$$ otherwise.

The equation for number of comparisons for $$$n, q$$$ becomes $$$b(n, q) = 1 + \sum_{i=1}^{q-1}P(n, i)b(n-i, q-i) + \sum_{i=q}^{n-1}P(n, i)b(i, q)$$$ and the expected number of comparisons for $$$n$$$, $$$\mathbb{E}(n) = \sum_{q=1}^{n-1}b(n, q)$$$

Simplifying gives

$$$\mathbb{E}(n) = 1 + \dfrac{1}{n}\sum_{i=1}^{n-1}(P_n(i)+P_n(n-i))i\mathbb{E}(i) = 1 + \dfrac{1}{n}\sum_{i=1}^{n-1}Q_n(i)i\mathbb{E}(i)$$$

$$$Q_n(i) = {P_n(i)+P_n(n-i)}$$$. Note that $$$Q_n$$$ is symmetric, i.e., $$$Q_n(i) = Q_n(n-i) \forall i \in [1..n]$$$.

[UPD: what follows is not quite rigorous as pointed out]

Using this we can prove that the normal binary search is optimal. We know that $$$O(1)\leq O(\mathbb{E}(i))\leq O(i)$$$, both are trivial bounds. Which $$$\implies O(i)\leq O(i\mathbb{E}(i))\leq O(i^2) \implies i\mathbb{E}(i)$$$ is concave upwards. Now we can easily prove that $$$Q_n(i) > 0$$$ for $$$i > \lceil\dfrac{n}{2}\rceil$$$ is sub-optimal. Since $$$Q$$$ is symmetric, $$$Q_n(i) = 0$$$ for $$$i < \lfloor\dfrac{n}{2}\rfloor$$$.

  • »
    »
    2 years ago, # ^ |
    Rev. 7   Vote: I like it +19 Vote: I do not like it

    The claim that $$$O(i) \le O(i \mathbb{E}(i)) \le O(i^2) \implies i\mathbb{E}(i)$$$ is concave upwards sounds a bit dubious to me (it doesn't hold for arbitrary functions).

    However, it can be shown that for uniformly random queries (i.e., for integers uniformly chosen from $$$[0, n]$$$) for the corrected version of binary search in the original post ($$$r = n$$$ instead of $$$r = n - 1$$$), we have that $$$\mathbb{E}(n) \ge \frac{f(n)}{n}$$$, where $$$f$$$ is the binary entropy function (naming as done in https://oeis.org/A003314, this property is proved in https://epubs.siam.org/doi/10.1137/0117001 — not open access but fairly simple to prove). This bound is sharp as well.

    It is a piecewise linear function with the breakpoints at powers of $$$2$$$, where its value is precisely $$$n \log_2 n$$$.

    • »
      »
      »
      2 years ago, # ^ |
      Rev. 2   Vote: I like it +13 Vote: I do not like it

      Here's a complete proof of why normal binary search is optimal (no matter which midpoint you pick). I'll rederive using the correct binary search, which is as follows:

      int lower_bound(int x) {
          int l = 0, r = n; // [l, r)
          while (l < r) {
              int m = rng(l, r); // random number in [l, r)
              if (a[m] >= x) r = m;
              else l = m + 1;
          }
          return l;
      }
      

      I will assume that the queries are uniformly random, from $$$[0, n]$$$ (rather than from $$$[0, n - 1]$$$, since there are $$$n + 1$$$ possible answers). I'll also revert back to $$$0$$$-based indexing.

      For a given $$$q$$$, the number of steps becomes $$$b(n, q) = 1 + \sum_{i = 0}^{q - 1} P(n, i) b(n - i - 1, q - i - 1) + \sum_{i = q}^{n - 1} P(n, i) b(i, q)$$$.

      Taking the average from $$$q = 0$$$ to $$$n$$$, we get $$$c(n) = 1 + \frac{1}{n + 1} \sum_{i = 0}^{n - 1} Q(n, i) (i + 1) c(i)$$$, where $$$Q(n, i) = P(n, n - i - 1) + P(n, i)$$$. Taking advantage of the fact that $$$Q(n, i) = Q(n, n - i - 1)$$$, we can write this as $$$c(n) = 1 + \frac{1}{2(n + 1)} \sum_{i = 0}^{n - 1} Q(n, i) ((i + 1) c(i) + (n - i) c(n - i - 1))$$$

      We can lower bound this expression by $$$1 + \frac{(k_n + 1) c(k_n) + (n - k_n) c(n - k_n - 1)}{n + 1}$$$, where $$$k_n = \arg \min_k (k + 1) c(k) + (n - k) c(n - k - 1)$$$. Note that this bound is achievable, by putting all probability mass at either $$$k_n$$$ or at $$$n - k_n - 1$$$.

      Let $$$d(n) = (n + 1) c(n)$$$. Then we have $$$d(n) = n + 1 + \min_{k = 0}^{n - 1} (d(k) + d(n - 1 - k))$$$.

      Lemma: If $$$f$$$ is convex on the integers, then we have $$$f(a) + f(b) \le f(a - k) + f(b + k)$$$ for $$$k \ge 0$$$.

      Proof: Use convexity to represent $$$a$$$ and $$$b$$$ as convex combinations of $$$a - k$$$ and $$$b + k$$$.

      Claim: $$$d(n)$$$ is convex over $$$\mathbb{Z}_{\ge 0}$$$.

      Proof: It is sufficient to show that $$$d(n) \le \frac{d(n - 1) + d(n + 1)}{2}$$$ for $$$n \ge 1$$$. This is trivially true for $$$n = 1$$$. Now let's suppose that $$$d$$$ is convex on $$$[0, n]$$$ for some $$$n \ge 1$$$. Note that by the lemma, for all $$$1 \le i \le n + 1$$$, $$$d(i) = i + 1 + d(\lfloor (i - 1)/2 \rfloor) + d(\lceil (i - 1)/2 \rceil)$$$. The inductive step is now straightforward, and we are done.

      Note that by the proof of the previous claim, we directly have that $$$k_n$$$ is the midpoint of the range (and when there are two midpoints, any of them works), and we are thus done.

      Edit: from this recurrence, using induction it is also fairly straightforward to show that this function is piecewise linear, and $$$d(2^n - 1) = n(2^n - 1)$$$ are the points where the slopes change. More precisely, we can show that if $$$n + 1$$$ is not a power of 2, then $$$2d(n) = d(n - 1) + d(n + 1)$$$.

»
2 years ago, # |
Rev. 2   Vote: I like it -21 Vote: I do not like it

My level is not very advanced but i know a lot about Generating Random Numbers, so if we want to pick a random number from l to r : so we will use this Equation => Start + rand() % (End — Start + 1).

but here m = l + rand() % (r — l), why didn't you add 1 between () ?

  • »
    »
    2 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Because picking the last element doesn't really do anything, as the comparison won't shrink the search range.

    Think of it as randomly selecting between the non-empty splits of an $$$n$$$-element array: there are exactly $$$(n-1)$$$ of them, not $$$n$$$.

    • »
      »
      »
      2 years ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      Ah. I think the culprit is that your lower_bound returns a value, not an index, and therefore assumes the last element is always larger than (or equal to) x. The usual lower_bound doesn't. Might be a good idea to add a comment about this to the post or something

    • »
      »
      »
      2 years ago, # ^ |
        Vote: I like it +11 Vote: I do not like it

      Better now, but lower_bound still never returns n if all elements are less than x.

      • »
        »
        »
        »
        2 years ago, # ^ |
          Vote: I like it +5 Vote: I do not like it

        Oh, for fuck's sake.

        Never thought that there would be a day I make two errors in a binary search.

        Especially after just writing a 20-page article on it.