div4only's blog

By div4only, history, 14 months ago, In English

Given an $$$1$$$-indexed integer array $$$a=[a_1,\,a_2,\,a_3,...,\,a_n]$$$ and a fixed windows size $$$k$$$,define sliding window $$$I_{j,\,(j \geq k)} := [a_{j-k+1}, a_{j-k+2}, ..., a_j]$$$. For each $$$j \geq k$$$, we need to answer a query:

How many sliding windows in $$$\{I_k, I_{k+1}, ..., I_{j-1}\}$$$ are less or equal than $$$I_j$$$ in the alphabetical order?

For example, $$$a=[1, 2, 1, 3]$$$ and $$$k=2$$$:

(1) For $$$j=2$$$, we should answer $$$0$$$.

(2) For $$$j=3$$$, we should answer $$$1$$$ as $$$[1,2] < [2,1]$$$ in alphabetical order.

(3) For $$$j=4$$$, we should answer $$$1$$$ as $$$[1,2] < [1,3]$$$ in alphabetical order.

Suffix array + LCP array can solve it in $$$O(nlogn)$$$ offline. But how about solving online? For example, what if $$$a$$$ is an unbounded datastream instead of an array? In the datastream case, you have to process $$$a_j$$$ and $$$I_j$$$ before reading $$$a_{j+1}$$$.

  • Vote: I like it
  • +16
  • Vote: I do not like it

»
14 months ago, # |
  Vote: I like it +10 Vote: I do not like it

With hashing+order statistics you can solve it in $$$O(n\log(n)\log(k))$$$. The order statistics tree gives you the position in the sorted sequence of all $$$I_j$$$ in $$$O(n\log(n)\cdot\texttt{cmp})$$$, where $$$\texttt{cmp}$$$ is the time needed to compare two sequences. With hashing you can compare two sequences in $$$O(\log(k))$$$ by binary searching the longest common prefix and then comparing the next index.

  • »
    »
    14 months ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    But here is a problem: The hash value of a window might be exponentially large, and you may need $$$O(k)$$$ time to maintain such a large number.

    • »
      »
      »
      14 months ago, # ^ |
        Vote: I like it +6 Vote: I do not like it

      Hash functions are surjective. There's no point in making it bijective for the reasons of performance. As a downside — collisions are possible. That's why you should minimize the chance of a collision (random moduli or base).