Generalizing Segment Trees with Rust

Правка en12, от EbTech, 2020-04-12 23:56:40

This blog post outlines the design of a very general data structure for associative range queries, in the Rust programming language.

In the "real world", self-balancing binary search trees can be augmented to handle a variety of range queries. However, for contest problems, statically allocated variants are much easier to code and usually suffice. The contest community has come to know these data structures as segments trees. Here, I will generalize most segment trees that you can find in the wild into one polymorphic data structure, that can easily be copy-pasted during online competitions. I will call it an ARQ tree. ARQ is pronounced "arc", which has a similar meaning to "segment", but also stands for "associative range query". It supports highly customizable range queries, the main requirement being that the aggregation operation must be associative.

Associativity and Semigroups

Let's begin with an array $$$a_1, a_2, a_3, \ldots, a_n$$$. Each $$$a_i$$$ belongs to a semigroup $$$(S, +)$$$, where + stands for any associative binary operation on $$$S$$$.

Associative Law: +: S x S -> S such that (a + b) + c = a + (b + c) for all a, b, c in S.

Because + is associative, we can drop the parentheses and talk about range aggregates of the form a_l + a_{l+1} + ... a_r.

The ARQ Problem

In the Associative Range Query problem, we wish to support two types of operations: - Given bounds l and r, compute the aggregate a_l + a_{l+1} + ... a_r - Given bounds l and r, and a function f: S -> S, replace a_i with f(a_i) for all l <= i <= r.

In typical instances where computing a + b or f(a) take O(1) time, we wish to support each operation in O(log n) time.

Identity and Monoids

The segment tree is often presented as operating on monoids instead of semigroups. A monoid is simply a semigroup with a special identity element:

Identity Law: id in S such that a + a + id = id + a = a for all a in S.

Here's how you would represent Semigroup and Monoid as traits in Rust. The programmer must verify that each function satisfies its corresponding law:

trait Semigroup {
    fn compose(&self, other: &Self) -> Self;
}

trait Monoid: Semigroup {
    fn identity() -> Self;
}

In practice the trait bound Monoid turns out to be equivalent to Semigroup + Clone, and either of the two would suffice for our purposes. A semigroup can trivially be extended into a monoid by adding an identity element, and a monoid can implement Clone via composition with the identity element:

impl<T: Semigroup + Clone> Semigroup for Option<T> {
    fn compose(&self, other: &Self) -> Self {
        match self {
            Some(ref a) => match other {
                Some(ref b) => Some(a.compose(b)),
                None => self.clone()
            },
            None => other.clone()
        }
    }
}

impl<T: Semigroup + Clone> Monoid for Option<T> {
    fn identity() -> Self {
        None
    }
}

impl<T: Monoid> Clone for T {
    fn clone(&self) -> Self {
        self.compose(T::identity())
    }
}

Arq Tree API v1: pointwise updates

Given that Monoid + Semigroup is equivalent to Monoid, the choice between the two is an implementation detail, which may have tradeoffs in performance and ergonomics depending on the application. Personally, I found it easier to work with the Monoid trait. Our first API will only support pointwise updates instead of range updates:

pub struct ArqTree<T> {
    val: Vec<T>
}

impl<T: Monoid> ArqTree<T> {
    pub fn modify(&mut self, pos: usize, f: &dyn Fn(&T) -> T) {
        // implement modify
    }
    
    pub fn query(&self, l: usize, r: usize) -> T {
        // implement query
    }
}

I won't provide a full implementation: you may use other segment tree guides as a reference. In summary, we build a complete binary tree on top of our array. modify(pos, f) will replace a[pos] by f(a[pos]), then recompute each of the ancestors of pos by applying + to its two children. This will work with no restrictions on the function f. Its time complexity is comprised of one application of f and O(log n) applications of +.

Shortcomings of the v1 API

Our simple v1 API can't support efficient range updates! In order to update an entire range efficiently, we will need to apply f lazily, storing it in internal nodes of the tree. If multiple updates are performed, we may have to store a composition of updates for postponed application. While one may implement a composition operation f \circle g which simply calls applies g first and then f, this makes the cost of function application no longer O(1)!

Thus, we must switch from function pointers to an implicit, composable representation of our update operator. The composition of "add 5" and "add 7" is not "add 5 and then 7", it's "add 12": we store the number 12 instead of the function.

To recap, now we have a monoid (S, +) of array elements, as well as a monoid (F, \circle) representing the set of update functions that we're interested in. Why is F a monoid? Well it's easy to check that function composition is associative, making it at least a semigroup. Then we again can choose whether to have F: Monoid or F: Semigroup + Clone. For F, I found the latter to be more ergonomic.

However, it's not enough to have two monoids! The sets F and M also interact, because F acts on elements of M to produce new elements of M. While we're at it, I'm actually not too happy with the Semigroup and Monoid traits. There's more than one way for a type, say 32-bit integers, to be a monoid: the operation could be addition, multiplication, minimum, maximum, leftmost non-identity value, etc. With this design, we'd have to wrap our i32s in distinct wrappers for each Monoid implementation.

But remember that a struct is just a collection of data, and a struct impl is a collection of functions, and possibly some associated type and constants. Typically, functions inside an impl take a special self element, and are called methods, but this is not strictly necessary. So we can instead define a trait that packages two types S and F, alongside functions that act on these types!

Arq Tree API v2: range updates

pub struct ArqTree<T: ArqSpec> {
    app: Vec<Option<T::F>>,
    val: Vec<T::S>,
}

impl<T: ArqSpec> ArqTree<T> {
    pub fn modify(&mut self, l: usize, r: usize, f: &T::F) {
        // implement modify
    }
    
    pub fn query(&mut self, l: usize, r: usize) -> T::S {
        // implement query
    }
}

pub trait ArqSpec {
    type F: Clone;
    type M;

    /// For eager updates, compose() ho be unimplemented!(). For lazy updates:
    /// Require for all f,g,a: apply(compose(f, g), a) = apply(f, apply(g, a))
    fn compose(f: &Self::F, g: &Self::F) -> Self::F;
    /// For eager updates, apply() can assume to act on a leaf. For lazy updates:
    /// Require for all f,a,b: apply(f, op(a, b)) = op(apply(f, a), apply(f, b))
    fn apply(f: &Self::F, a: &Self::M) -> Self::M;
    /// Require for all a,b,c: op(a, op(b, c)) = op(op(a, b), c)
    fn op(a: &Self::M, b: &Self::M) -> Self::M;
    /// Require for all a: op(a, identity()) = op(identity(), a) = a
    fn identity() -> Self::M;
}

This version supports the previous setting of pointwise updates as well. In that case, identity() and op() must satisfy their respective monoid laws, but apply() can apply any arbitrary function, and compose() can remain unimplemented or even crash because my implementation will never call compose() when the modify operations use l == r.

However, if we plan to do range modifications, in which l < r, then f might get applied to an internal node of the tree! To ensure consistency, we require two additional laws:

Composition Law: (f \circle g)(a) = f(g(a)) for all f, g in F, a in S

Distributive Law: f(a + b) = f(a) + f(b) for all f in F, a, b in S

The composition law implies that F is a semigroup, and the distributive law ensures consistent interactions between F and S throughout the tree!

Static Implementation

GitHub link

To keep this blog post focused on the abstraction and general API, I won't go much implementation detail. This is a binary indexed arq tree based on a very cool blog post by Al.Cash, so I recommend reading that if you're interested!

Dynamic Implementation

GitHub link

And it turns out, a dynamically allocated version of the data structure can be initialized in O(1) instead of O(n) time, enabling implicit representation of array sizes exceeding 10^18! This data structure also has a persistence flag that can be toggled. When persistence is turned off, one should commit to one view of the data structure, and assume that all other views will be destroyed.

Теги rust, #segment tree, #lazy propagation, #persistence, monoid

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en30 Английский EbTech 2020-04-14 07:01:18 30 Tiny change: 's _segments trees_. H' -> 's _segment trees_. H'
en29 Английский EbTech 2020-04-14 05:55:59 4 Tiny change: 't as with _S_, we can c' -> 't as with $S$, we can c'
en28 Английский EbTech 2020-04-14 05:55:06 17 Tiny change: ', just as before, can choos' -> ', just as with _S_, we can choos'
en27 Английский EbTech 2020-04-14 05:46:13 5 Tiny change: 'push()`/`pop()` API ma' -> 'push()`/`pull()` API ma'
en26 Английский EbTech 2020-04-14 04:59:38 3732 Tiny change: 'ongs to a semigroup $(S, +)$;' -> 'ongs to a _semigroup_ $(S, +)$;'
en25 Английский EbTech 2020-04-14 03:10:53 63
en24 Английский EbTech 2020-04-14 03:09:03 169 Tiny change: 'pagation, based on ' -> 'pagation, which I like to call an ARQBIT. It's based on '
en23 Английский EbTech 2020-04-14 02:53:27 1126 Advanced Usage section
en22 Английский EbTech 2020-04-13 23:23:05 23 Tiny change: ' ARQ tree based on ' -> ' ARQ tree with lazy propagation, based on '
en21 Английский EbTech 2020-04-13 23:20:44 647
en20 Английский EbTech 2020-04-13 23:05:34 304
en19 Английский EbTech 2020-04-13 22:38:38 1708 Tiny change: '}\n~~~~~\nIn pract' -> '}\n~~~~~\n\n### Equivalence\n\nIn pract'
en18 Английский EbTech 2020-04-13 02:31:13 18
en17 Английский EbTech 2020-04-13 01:11:47 2 Tiny change: ' S \times X \rightarr' -> ' S \times S \rightarr'
en16 Английский EbTech 2020-04-13 01:10:47 81
en15 Английский EbTech 2020-04-13 01:08:33 1742 Tiny change: ')$, where + stands fo' -> ')$, where $+$ stands fo' (published)
en14 Английский EbTech 2020-04-13 00:43:22 5983
en13 Английский EbTech 2020-04-13 00:19:49 705 Tiny change: 'a_{l+1} + ... a_r$.\n\n' -> 'a_{l+1} + \ldots + a_r$.\n\n'
en12 Английский EbTech 2020-04-12 23:56:40 741 Tiny change: ' language.\n\nIn the' -> ' language. $\sqrt{3}$\n\nIn the'
en11 Английский EbTech 2019-07-25 22:44:43 67
en10 Английский EbTech 2019-07-25 22:43:58 6 Tiny change: 'sentation]\n//! (http://co' -> 'sentation](http://co'
en9 Английский EbTech 2019-07-25 22:43:12 969
en8 Английский EbTech 2019-07-25 22:38:33 821
en7 Английский EbTech 2019-07-25 22:31:36 710
en6 Английский EbTech 2019-07-25 22:19:38 2061
en5 Английский EbTech 2019-07-25 22:01:00 1493
en4 Английский EbTech 2019-07-16 10:30:56 2 Tiny change: 'ing law:\n~~~~~\nt' -> 'ing law:\n\n~~~~~\nt'
en3 Английский EbTech 2019-07-16 10:30:18 1217
en2 Английский EbTech 2019-07-16 06:20:45 1916 Tiny change: 'usize, f: fn(&T) -> T' -> 'usize, f: &dyn Fn(&T) -> T'
en1 Английский EbTech 2019-07-16 05:23:54 1106 Initial revision (saved to drafts)