

# Vectorization of Philox CBRNG on AVX2, AVX512 and Extending Agner FOG's Vector Class Library for ARM NEON Support

Yigit Demirag

Advisor: Dr. Sandro Wenzel





## Outline

- 1. Vectorization of Philox CBRNG \*
- 2. ARM NEON Support for Agner FOG's Vector Class Library \*\*

\* http://www.thesalmons.org/john/random123/ \* \* http://www.agner.org/optimize/vectorclass.pdf





Part I: Vectorization of Philox CBRNG • Project Description

- CBRNGs are widely used at CERN, especially in MC Simulations in GEANT4 and ROOT.
- PRNGs are deterministic algorithms in form of

uint\_64t someRandomNumber = CBRNG(uint64\_t key, uin64\_t counter)

• Philox is a SP Network. S box is a simple Feistel function with 72 rounds 64-bit [XOR, MUL]

 $L' = B_k(R) = mullo(R, M) \ R' = F_k(R) \oplus L = mulhi(R, M) \oplus k \oplus L$ 





## Part I: Vectorization of Philox CBRNG • The Mulhilo Bottleneck

• The main problem in vectorization: 64-bit widening multiplication causing bottleneck.

```
static __inline__ uint64_t mulhilo64(uint64_t a, uint64_t b, uint64_t* hip) {
    __uint128_t product = ((__uint128_t)a)*((__uint128_t)b);
    *hip = product>>64;
    return (uint64_t)product;
}
```

- There is no 64b x 64b -> 128b arithmetic as a vector instruction. Nor is there a vector mulhi type instruction (high word result of multiply).
- Haswell's MULX instruction was a scalar candidate (No effect on register flags, flexible register use).
- Karatsuba Multiplication Algorithm ? (Efficient for much bigger multiplications.)
- Or a new multiplication algorithm designed for SIMD?





## **Part I: Vectorization of Philox CBRNG**







## Part I: Vectorization of Philox CBRNG • The New Algorithm

|                                              | m256i x,m256i y,m256i *lo, _<br>= _mm256_set1_epi64x(0xffffffff);                                                                  | _m256i *hi) {                                       |
|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
| m256i xh<br>m256i yh                         | <pre>= _mm256_shuffle_epi32(x, 0xB1);<br/>= _mm256_shuffle_epi32(y, 0xB1);</pre>                                                   |                                                     |
| m256i w0<br>m256i w1<br>m256i w2<br>m256i w3 | <pre>= _mm256_mul_epu32(x, y);<br/>= _mm256_mul_epu32(x, yh);<br/>= _mm256_mul_epu32(xh, y);<br/>= _mm256_mul_epu32(xh, yh);</pre> | // x0h*y0l, x1h*y0l                                 |
| m256i w0l<br>m256i w0h                       | <pre>= _mm256_and_si256(w0, lomask);<br/>= _mm256_srli_epi64(w0, 32);</pre>                                                        | //(* Not required at AVX5                           |
| m256i s1<br>m256i s11<br>m256i s1h           | <pre>= _mm256_add_epi64(w1, w0h);<br/>= _mm256_and_si256(s1, lomask);<br/>= _mm256_srli_epi64(s1, 32);</pre>                       |                                                     |
| m256i s2<br>m256i s21<br>m256i s2h           | <pre>= _mm256_add_epi64(w2, s11);<br/>= _mm256_s1li_epi64(s2, 32);<br/>= _mm256_srli_epi64(s2, 32);</pre>                          | //(* Not required at AVX5                           |
| m256i hi1<br>hi1                             | <pre>= _mm256_add_epi64(w3, s1h);<br/>= _mm256_add_epi64(hi1, s2h);</pre>                                                          |                                                     |
|                                              | <pre>= _mm256_add_epi64(w0l, s2l);<br/>= _mm512i_mullo_epi64(x,y);</pre>                                                           | //(* Not required at AVX5<br>//alternative AVX512 i |
| <pre>*hi = hi1; *lo = lo1; }</pre>           |                                                                                                                                    |                                                     |

*Alternative* instruction existed only on AVX512, we could have advantage of vectorization at Xeon Phi.





## Part I: Vectorization of Philox CBRNG • AVX512(MIC) Implementation

Philox will benefit from larger vector registers in AVX512. Our goal was to provide an implementation that can be used and

benchmarked on the next generation hardware, specifically on Intel's Knights Landing(KNL) architecture.

- FOG's VCL supports only KNL, but OpenLab contribution for Knights Corner(KNC) exists.
- I vectorized Philox library both for KNC and KNL, via both library and macros. But, since the new vectorized multiplication algorithm requires specific KNL instructions, we haven't benchmarked this work yet.





## Part-II : ARM Support for Agner FOG's VCL • Project Description

- We wanted to provide an easy-to-use ARM NEON support for Agner FOG's Vector Class Library.
- The aim was to compile the same code for both Intel's SSE and ARM architecture.
- We concentrated on single and double precision floating point operations of ARM NEON (v7 and v8), which

has 128-bit registers that can be used as 64bx2 or 32bx4 vector operations.

An ARM processor uses ~ 5-15 watts, a traditional computer can use up to 80 watts. Server farms have on the order of tens of thousands of computers and so this translates to significant savings. However, in order for this to work, parallel computing for the right type of computation or analysis must be exploited. Luckily, HEP data analysis is highly parallelizable.





#### 1 #define MAX\_VECTOR\_SIZE 128

|    | <pre>#include</pre> |                                                                    |
|----|---------------------|--------------------------------------------------------------------|
|    | <pre>#include</pre> |                                                                    |
|    |                     |                                                                    |
|    |                     |                                                                    |
|    | Vec4f te            | est1(Vec4f a, Vec4f b){                                            |
|    |                     |                                                                    |
|    |                     | Vec4f c = b + b;                                                   |
| 10 |                     | c = c - 2;                                                         |
| 11 |                     | c++; a; a /= b;                                                    |
| 12 |                     | a = a * c;                                                         |
| 13 |                     |                                                                    |
| 14 |                     | <pre>Vec4fb tmp(true,true,false,false);</pre>                      |
| 15 |                     | Vec4fb d = $(a \le b);$                                            |
| 16 |                     | $d = a^{(b \& c)};$                                                |
| 17 |                     |                                                                    |
| 18 |                     | //FLOATING POINT OPERATIONS                                        |
| 19 |                     | b = max(a,b);                                                      |
| 20 |                     | Vec4f m = $sqrt(b)$ ;                                              |
| 21 |                     | Vec4f k = $abs(m)$ ;                                               |
| 22 |                     | $k = mul_add(b,m,c); //b*m +c$                                     |
| 23 |                     |                                                                    |
| 24 |                     | <pre>double t = horizontal_add(c);</pre>                           |
| 25 |                     | <pre>float ext = m.extract(3);</pre>                               |
| 26 |                     | return $2 * (a*a + b) + c;$                                        |
| 27 | }                   |                                                                    |
| 28 |                     |                                                                    |
| 29 |                     |                                                                    |
|    | int mair            | n(){                                                               |
| 31 |                     | Vec4f a(3.1, 2.7, 2.3, 4.9); //CONSTRUCTOR                         |
| 32 |                     | <pre>Vec4f b(1.0); //CONSTRUCTOR WITH BROADCASTING</pre>           |
| 33 |                     | <pre>Vec4fb booltest (true,false,false,true);</pre>                |
| 34 |                     |                                                                    |
| 35 |                     | <pre>Vec4f r = test1(a,b); //TEST OF VECTOR OPS</pre>              |
| 36 |                     | <pre>std::cout &lt;&lt; r.size() &lt;&lt; std::endl; //PRINT</pre> |
| 37 |                     |                                                                    |
| 38 |                     | <pre>float values[4];</pre>                                        |
| 39 |                     | r.store(values); //STORE VECTOR INTO MEMORY                        |
| 40 |                     |                                                                    |
| 41 |                     | <pre>for(int i=0;i&lt;5;++i)</pre>                                 |
| 42 |                     | <pre>std::cout&lt;&lt; values[i] &lt;&lt; std::endl;</pre>         |
| 43 |                     |                                                                    |
| 44 |                     | return 0:                                                          |
| 45 | 3                   |                                                                    |
|    |                     |                                                                    |

arm-none-linux-gnueabi-g++

-C -mfloat-abi=softfp -mfpu=neon -O3



| 89e0: | e24dd008 | sub sp, sp     | , #8                 |
|-------|----------|----------------|----------------------|
| 89e4: | e24dd020 | sub sp, sp     | #32                  |
| 89e8: | eddd0b0c | vldr d16, [    | sp, #48] ; 0x30      |
| B9ec: | eddd1b0e | vldr d17, [    | sp, #56] ; 0x38      |
| 89f0: | f3fb2560 | vrecpe.f32     | q9, q8               |
| 89f4: | f2408ff2 | vrecps.f32     | q12, q8, q9          |
| 89f8: | f3488df2 | vmul.f32       |                      |
| B9fc: | e28d101c | add r1, sp     | , #28                |
| Ba00: | f2402ff8 | vrecps.f32     | q9, q8, q12          |
| Ba04: | f2c76f50 | vmov.f32       | q11, #1 ; 0x3f800000 |
| Ba08: | e981000c | stmib r1, {r   | 2, r3}               |
| Ba0c: | f2404de0 | vadd.f32       | q10, q8, q8          |
| Ba10: | edddab08 | vldr d26, [:   | sp, #32]             |
| Ba14: | edddbb0a | vldr d27, [:   | sp, #40] ; 0x28      |
| Ba18: | f2c0cf50 | vmov.f32       | q14, #2 ; 0x40000000 |
| Balc: | f3428df8 | vmul.f32       | q12, q9, q12         |
| Ba20: | f26aade6 | vsub.f32       | q13, q13, q11        |
| Ba24: | f2644dec | vsub.f32       | q10, q10, q14        |
| Ba28: | f2444de6 | vadd.f32       | q10, q10, q11        |
| Ba2c: | f34a8df8 | vmul.f32       | q12, q13, q12        |
| Ba30: | ed9f7b0a | vldr d7, [p    | c, #40]              |
| Ba34: | f3482df4 | vmul.f32       | q9, q12, q10         |
| Ba38: | f2420fe0 | vmax.f32       | q8, q9, q8           |
| Ba3c: | f3422df2 | vmul.f32       | q9, q9, q9           |
| Ba40: | f2422de0 | vadd.f32       | q9, q9, q8           |
| Ba44: | f3e229c7 | vmul.f32       | q9, q9, d7[0]        |
| 8a48: | f2424de4 | vadd.f32       | q10, q9, q10         |
| Ba4c: | f4404adf | vst1.64 {d20-d | 21}, [r0 :64]        |
| Ba50: | e28dd020 | add sp, sp     | , #32                |
| Ba54: | e28dd008 | add sp, sp     | , #8                 |
| Ba58: | e12fff1e | bx lr          |                      |
| Ba5c: | e1a00000 | nop            | ; (mov r0, r0)       |
| Ba60: | 40000000 | andmi r0, r0   | , r0                 |
| Ba64: | 00000000 | andeq r0, r0   | , r0                 |
|       |          |                |                      |

#### 00000000000400af0 < Z5test15Vec4fS >:

| 0000000000000000 | 10 - | _25 | ces. |    | CC. | +15 |    |      |     |                           |
|------------------|------|-----|------|----|-----|-----|----|------|-----|---------------------------|
| 400af0:          | 0f   | 58  | 05   | f9 | 00  | 00  | 00 | addp | )S  | 0xf9(%rip) <b>,</b> %xmm0 |
| 400af7:          | 0f   | 28  | d9   |    |     |     |    | mova | ips | %xmm1,%xmm3               |
| 400afa:          | 0f   | 58  | d9   |    |     |     |    | addp | )S  | %xmm1,%xmm3               |
| 400afd:          | 0f   | 28  | dØ   |    |     |     |    | mova | ips | %xmm0 <b>,</b> %xmm2      |
| 400b00:          | 0f   | 58  | 1d   | c9 | 00  | 00  | 00 | addp | )S  | 0xc9(%rip) <b>,</b> %xmm3 |
| 400b07:          | 0f   | 5e  | d1   |    |     |     |    | divp | s   | %xmm1,%xmm2               |
| 400b0a:          | 0f   | 58  | 1d   | cf | 00  | 00  | 00 | addp | )S  | 0xcf(%rip) <b>,</b> %xmm3 |
| 400b11:          | 0f   | 59  | d3   |    |     |     |    | mulp | )S  | %xmm3 <b>,</b> %xmm2      |
| 400b14:          | 0f   | 28  | c2   |    |     |     |    | mova | ips | %xmm2 <b>,</b> %xmm0      |
| 400b17:          | 0f   | 59  | d2   |    |     |     |    | mulp | )S  | %xmm2 <b>,</b> %xmm2      |
| 400b1a:          | 0f   | 5f  | c1   |    |     |     |    | maxp | )S  | %xmm1 <b>,</b> %xmm0      |
| 400b1d:          | 0f   | 58  | dØ   |    |     |     |    | addp | )S  | %xmm0 <b>,</b> %xmm2      |
| 400b20:          | 0f   | 58  | d2   |    |     |     |    | addp | )S  | %xmm2 <b>,</b> %xmm2      |
| 400b23:          | Øf   | 58  | d3   |    |     |     |    | addp | )S  | %xmm3,%xmm2               |
| 400b26:          | 0f   | 28  | c2   |    |     |     |    | mova | ips | %xmm2 <b>,</b> %xmm0      |
| 400b29:          | c3   |     |      |    |     |     |    | reto | 1   |                           |
| 400h2a:          | 66   | Øf  | 1 f  | 44 | 00  | 00  |    | nonw | /   | 0x0(%rax.%rax.1)          |

#### g++ -C -msse4.2 -O3 -ftree-vectorize -

std=c++11 -fabi-version=0





## Part-II : ARM Support for Agner FOG's VCL • Completed Parts for ARM v7

| Types /<br>Number of<br>Functions | Construction | Loading Data<br>to Vectors | Reading Data<br>from Vectors | Arithmetic<br>Operators | Logic<br>Operators | Floating<br>Point<br>Functions | Permute,<br>blend, lookup<br>and gather<br>functions | Conversation<br>between<br>vector types |
|-----------------------------------|--------------|----------------------------|------------------------------|-------------------------|--------------------|--------------------------------|------------------------------------------------------|-----------------------------------------|
| Agner FOG's<br>Vector Lib         | 5            | 7                          | 9                            | 13                      | 18                 | 10                             | 3                                                    | 19                                      |
| New<br>ARM<br>Extension           | 5            | 6*                         | 8*                           | 13                      | 18                 | 4**                            | 0**                                                  | 2**                                     |

# **Current Progress**

- \* Not implementable in ARM v7
- \*\* Not yet completed





# **Thank You!**

*I'm sorry for not being able to participate, I am on a Boeing 737-800 and* 

flying to Ankara for registration of my grad school at this very moment.

Yigit Demirag yigitdemirag@gmail.com yigitdemirag.com

