Tutorial on SIMD vectorisation to speed up brute force

Revision en3, by maomao90, 2022-01-02 13:33:21

I decided to write a blog on this as I was doing a problem on local judge and I decided to try to speed up my brute force code. However, it was quite difficult to find resources on SIMD vectorisation, so I decided to try to compile some of the resources I found together to hopefully allow more people to learn to scam brute force solutions

Thanks to iLoveIOI for proof reading.

Introduction

SIMD stands for single instruction, multiple data. SIMD allows us to give vector instructions which will allow the code to run faster. Vector instructions are instructions that handle short (2-16) vectors of integers / floats / characters in a parallel way by making use of the extra bits of space to do operations simultaneously.

The most common form of vectorisation is making use of pragmas such as

#pragma GCC optimize("O3,unroll-loops")
#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")

This form of vectorisation is known as auto-vectorisation as the compiler vectorises the code for you. However, for more complicated examples, the compiler might be unable to detect what to vectorise automatically, so in this case we have to vectorise our code automatically by using SIMD vectorisation.

Basics

To make use of SIMD, we have to add the following code at the top of the code.

#include <nmmintrin.h>

#pragma GCC target("avx2")

Data types

There are three 128 bit data types for integers, floats and doubles.

__m128i i; // stores 4 integers since each integer is 32 bit and 128/32=4
__m128 f; // stores 4 floats since each float is 32 bit and 128/32=4
__m128d d; // stores 2 doubles since each double is 64 bit and 128/64=2

Note that __m128i can also be used to store 32 characters.

Intrinsics

In order to do operations on these data types, we will make use of SIMD intrinsics.

All the intrinsic functions have the following form _mm_instruction_datatype, for example, _mm_add_epi32(a, b) is a function to add 32-bit integers from $$$a$$$ and $$$b$$$ together.

Some examples of data-types are

  • si128 — signed 128-bit integer

  • epi8, epi32, epi64, epu8, epu32, epu64 — vector of signed 8-bit integers (character), signed 32-bit integers, signed 64-bit integers, together with the unsigned variations.

  • ps — 32-bit floats

  • pd — 64-bit doubles

In the following examples, I will be focusing on signed 32-bit integers (i.e. epi32) as it is most commonly used.

Loading

To make use of our __m128i data structure, we first have to load data into it.

__m128i zero = _mm_setzero_si128(); // set everything to 0
__m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8

__m128i pi = _mm_setr_epi32(1, 4, 1, 3); // NOTE: set and setr are opposites of each other
// mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1
// mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3

int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
__m128i a0 = _mm_load_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]
__m128i a4 = _mm_load_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]

Arithmetic

// Continue from previous code block
__m128i sum = _mm_add_epi32(a0, a4); // [4, 6, 8, 10] (sum i-th element of a0 with i-th element of a4)
__m128i sb = _mm_sub_epi32(pi, a0); // [1, 3, -1, 0] (subtract i-th element of a0 from i-th element of pi)
__m128i mx = _mm_max_epi32(pi, a0); // [3, 1, 4, 3] (maximum of the i-th element of pi and i-th element of a0)

__m128i mul = _mm_mul_epi32(sum, mx); // [12, 32] (4*3=12, 8*4=32) (more details below)

Most of the simpler arithmetic operations work by doing the arithmetic operation on each index from 0 to 3. However, for _mm_mul_epi32, since the multiplication of two 32-bit integers will become 64-bit integers, we can only store 2 64-bit integers in the result. Hence, _mm_mul_epi32 works by multiplying the 0-th index and 2-th index.

Reordering

Sometimes we might need to change the order of the 4 numbers

// Continue from previous code block

Tags brute-force, simd, scam

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en30 English maomao90 2024-04-07 12:38:41 746
en29 English maomao90 2023-07-31 12:10:52 18 Tiny change: 'ation.\n\n[cut]\n\nSyntax' -> 'ation.\n\nSyntax'
en28 English maomao90 2023-07-31 12:07:15 9 Tiny change: 'ation.\n\nSyntax' -> 'ation.\n\n[cut]\n\nSyntax'
en27 English maomao90 2022-01-04 15:10:01 0 (published)
en26 English maomao90 2022-01-04 15:08:59 445 Added mullo (saved to drafts)
en25 English maomao90 2022-01-03 15:44:16 0 (published)
en24 English maomao90 2022-01-03 15:43:41 1045 Added conclusion
en23 English maomao90 2022-01-03 14:09:40 19
en22 English maomao90 2022-01-03 14:01:41 454
en21 English maomao90 2022-01-03 13:54:48 141 More minor edits
en20 English maomao90 2022-01-03 13:49:50 342 Minor edits
en19 English maomao90 2022-01-03 11:54:29 19
en18 English jamessngg 2022-01-03 11:51:18 14
en17 English maomao90 2022-01-03 11:46:17 3 Tiny change: ' 3 spaces 4 intege' -> ' 3 spaces as 4 intege'
en16 English maomao90 2022-01-03 11:43:21 14367 Added more examples and references
en15 English maomao90 2022-01-03 10:10:08 2425 Added logical arithmetic
en14 English maomao90 2022-01-03 06:44:08 13 Added u to loadu and storeu
en13 English maomao90 2022-01-03 06:07:16 4 Tiny change: 'nly store 2 64-bit in' -> 'nly store two 64-bit in'
en12 English maomao90 2022-01-03 04:26:21 18 Tiny change: 'pi32(sum, (1 << 8) - 1);\n }' -> 'pi32(sum, 0b11111111);\n }'
en11 English iLoveIOI 2022-01-03 03:40:25 8
en10 English iLoveIOI 2022-01-03 03:39:44 28 Tiny change: '128 bits. Similar t' -> '128 bits. **Note that b is before a** Similar t'
en9 English iLoveIOI 2022-01-03 03:32:52 239
en8 English maomao90 2022-01-02 18:23:36 3815 Added examples and made minor edits
en7 English maomao90 2022-01-02 16:44:58 2611 Finalised syntax section
en6 English maomao90 2022-01-02 15:35:09 1080 Finish reordering
en5 English maomao90 2022-01-02 14:16:50 1267 Completed 2 functions of reordering
en4 English jamessngg 2022-01-02 13:40:06 38
en3 English maomao90 2022-01-02 13:33:21 2 Tiny change: 'de block\n~~~~~' -> 'de block\n\n~~~~~'
en2 English maomao90 2022-01-02 13:28:26 3049 Finish loading and arithmetic
en1 English maomao90 2022-01-02 10:52:20 1383 Initial revision (saved to drafts)