Rating changes for the last round are temporarily rolled back. They will be returned soon. ×

twoslow's blog

By twoslow, history, 8 months ago, In English,

Hello, codeforces community

  • I've a 2D sparse matrix of 0's and 1's of size NxN, where N = 16 Million i.e 16,000,000.
  • The number of non zeros(NNZ) in the matrix are ~450 Million. i.e 450,000,000.

What is the most efficient way to store the such matrix, given that I'll get the queries like:

  • What are the non-zero elements in row number x.
  • What are the non-zero elements in column number y.

Currently I'm storing the matrix in CSR format. Which takes an integer array of size NNZ and another array of size N.

Which is taking

  • (450000000 * 4 Bytes) / (1024 * 1024) ~ 1716MB of memory for row storage, and same amount of memory for column storage.

The overall memory to store such structure is about 3.5GB.

I want to compress it in a way such that the memory consumption is reduced and the queries can still be fast. I am looking for some kind of serialization technique or something which can reduce the memory consumption or time complexity.

Any help regarding this is most welcome.

 
 
 
 
  • Vote: I like it
  • +3
  • Vote: I do not like it

»
8 months ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

have you tried run length encoding? Another way is just to store index of the next one in a row/column. then delete the extra zeros.

  • »
    »
    8 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I haven't tried RLE, will read about it. The CSR format doesn't store 0's. I use 2 Arrays one is of size N and another of size NNZ.

»
8 months ago, # |
  Vote: I like it +1 Vote: I do not like it

You could get away with 3 bytes instead of 4 bytes because the column indices and row indices  < 16000000 < 16777216 = 224.

  • »
    »
    8 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    This would have been great optimisation, but unfortunately the number of rows can increase, It increase at rate 0.1 Million / month. So having 3 bytes would not be good solution after ~10 months.

    • »
      »
      »
      8 months ago, # ^ |
        Vote: I like it +1 Vote: I do not like it

      There are some ways to get around that. You could for example split your matrix into chunks being able to be indexed with 3 bytes, or maybe even 2 bytes.