Curiosity about SSE Instructions

#	User	Rating
1	tourist	3843
2	jiangly	3705
3	Benq	3628
4	orzdevinwang	3571
5	Geothermal	3569
5	cnnfls_csy	3569
7	jqdai0815	3530
8	ecnerwala	3499
9	gyh20	3447
10	Rebelz	3409

#	User	Contrib.
1	maomao90	171
2	awoo	164
3	adamant	162
4	TheScrasse	159
5	nor	153
5	maroonrk	153
7	-is-this-fft-	152
8	Petr	146
9	orz	145
10	pajenegod	144

Hi, today I bring to you a question about something different, and perhaps completely useless anyways... :P

After user logicmachine's brilliant SSE solution to some problem (Link), I wonder how much SSE instructions really help program run speed. From the comments in that thread, it helps cut the speed by a factor of approximately 1/4, but is it really true?

I personally am not familiar with these things, but if they will really cut the speed of a program, that would be something curious to look into (maybe fit the TL with O(N^2) for N=1e5 ;))

Now you think, just get better, who needs these magic tricks, they're not even fair :P. But there are always some cases where I have some program, and it's barely over TL, and microoptimizations like these would (maybe?) bring it down to the TL. There's been some case where I've changed all the ints to shorts just to fit the TL... xD

So just for fun, if anyone can provide some light on SSE instructions (do they really help?), that would be pretty cool!

Thanks,

minimario

#include <immintrin.h> int dist2(const char* s, const char* t, unsigned int len) { __m256i a, b, c, e, f, g, h, x, x2, y, y2, b2, c2; int i = 0, d = len; for (; i < len;) { int j = 0; x = _mm256_set1_epi32(0); x2 = _mm256_set1_epi32(0); int lim = len, lim2 = i + 64 * 120; if (lim2 < lim) lim = lim2; for (; i < lim; i += 64) { b = _mm256_loadu_si256((void*)(s + i)); c = _mm256_loadu_si256((void*)(t + i)); y = _mm256_cmpeq_epi8(b, c); b2 = _mm256_loadu_si256((void*)(s + i + 32)); c2 = _mm256_loadu_si256((void*)(t + i + 32)); y2 = _mm256_cmpeq_epi8(b2, c2); x = _mm256_add_epi8(x, y); x2 = _mm256_add_epi8(x2, y2); } signed char z[64]; _mm256_storeu_si256((void*) z, x); _mm256_storeu_si256((void*) (z + 32), x2); for (j = 0; j < 64; j++) { d += z[j]; } } return d; } int dist(const char* s, const char* t, unsigned int len) { int i = 0, d = 0; while (len % 64) { len--; d += s[len] != t[len]; } d += dist2(s, t, len); return d; }

Comments (3)

Write comment?

gongy

8 years ago, # |

I'm really interested in this, too. It seems like pretty awesome knowledge to have (even if not always useful) but I"m not sure how to go about learning it.

→ Reply

slycelote

If a large part of your algorithm can be vectorized, then you can indeed get a 4 times speedup. Source: recently fixed a bug at my job that lead to a non-SSE matrix multiplication at one point.

microtony

← Rev. 2 →

+10

To compute the Hamming Distance of two strings we can use this function:

If we make 200000 calls with len = 200000, it takes 8.196 seconds.

int dist(const char* s, const char* t, unsigned int len) {
    int i, d = 0;
    for (i = 0; i < len; i++) {
        d += s[i] != t[i];
    }
    return d;
}

This AVX implementation only takes 0.538 seconds: (Edit: this uses AVX2)

minimario's blog