Bekh's blog

By Bekh, history, 3 years ago, In English

Hello,

I was implementing a simple Trie when I found a behavior that I didn't understand. I followed the program using a debugger, the issue seems to be in the insert function. I explained the issue in code comments.

struct Trie{
    struct Node{
        int id, nxt[26];
        Node(int id) : id(id)
        {
            memset(nxt, -1, sizeof(nxt));
        }
    };
    vector<Node> v;
    vector<bool> leaf;
    void init()
    {
        v.clear();
        leaf.clear();
        addNode();
    }

    int addNode()
    {
        v.push_back(v.size());
        leaf.push_back(false);
        return (int)(v.size()) - 1;
    }

    int getSizeMinusOne()
    {
        return (int)(v.size()) - 1;
    }

    void insert(const string &s)
    {
        int cur = 0;
        for (char c : s)
        {
            c -= 'a';
            if (v[cur].nxt[c] == -1)
            {
                // Note I only use one of the following NOT both at the same time
                
                // Approach (1)
                // v[cur].nxt[c] doesn't get updated. Stays -1 and RTEs later on due to out of boundary access
                v[cur].nxt[c] = addNode();

                // Approach (2)
                // works fine
                addNode();
                v[cur].nxt[c] = getSizeMinusOne();
            }
            cur = v[cur].nxt[c];
        }
        leaf[cur] = true;
    }
};

I also found out that both approaches seem to work find when changing the array nxt to a vector instead of an array. Could someone explain why this happens when nxt is an array? And why approach (2) works fine even if nxt is an array?

  • Vote: I like it
  • 0
  • Vote: I do not like it

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by Bekh (previous revision, new revision, compare).

»
3 years ago, # |
  Vote: I like it +16 Vote: I do not like it

Bad stuff happens when a vector resizes. Saving addNode() to an intermediate variable and then setting v[cur].nxt[c] equal to that variable suffices (or what you did in approach 2).

  • »
    »
    3 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Could you provide more insight on what kind of bad stuff happen? When do the bad stuff effects fade and I can safely use the vector again? Why is the program not affected by the bad stuff if nxt is a vector instead of an array (Or is it just a coincidence?)?

    • »
      »
      »
      3 years ago, # ^ |
      Rev. 2   Vote: I like it +16 Vote: I do not like it

      I asked ecnerwala about something similar a while back, I think the same reasoning applies (although I can't say that I understand it that well).

      Unfortunately, before c++17, the index could get resolved first

      So it first computes v[cur].nxt[c] and stores that pointer

      And then calls addNode()

      But then that pointer is now invalid

      Ubsan should be able to catch this

      Pretty much in c++, evaluation order is essentially entirely up to the compiler

      It is true that in c++17, they guarantee rhs is evaluated before lhs, but i assume you want this code to work without that

      • »
        »
        »
        »
        3 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Aha, got you. And yes, I'm using C++14. Calling v.reserve() with a big amount before running the code works fine so this is indeed the issue.
        And it seems to evaluate the rhs first when nxt is a vector for some reason lol (Tried generating many random strings and inserting them to make sure pointers are indeed getting invalidated).
        Thank you for the explanation.