

# **Intel® Parallel Studio XE**

Hans Pabst, Developer Product Division

# Future Challenges in Tracking and Trigger Concepts

2<sup>nd</sup> International Workshop, July 8<sup>th</sup>

# **Motivation: Performance**





Software & Services Group Developer Products Division



# Intel® Many Integrated Core (MIC) Co-Processor Architecture



## **Knights Ferry Software Development Platform**

- Up to 32 cores, 128 threads
- 512-bit SIMD support
- Fully coherent cache
- Up to 2 GB GDDR5 memory
- Latest Intel SW developer products



## First Intel® MIC product

- Codenamed "Knights Corner"
- Planned for production on Intel's 22 nm 3D Tri-Gate transistor technology



Software & Services Group Developer Products Division



# **One source base, tuned to many targets**







# **Programming Models and Libraries**





Software & Services Group Developer Products Division



# **Performance Optimization Steps**

## Shown steps enable to scale forward to many-core co-processors.

#### Make use of SIMD

extensions, e.q. Intel® AVX.

Vectorization

#### Intel<sup>®</sup> Compiler

- Optimization hints
- #pragma simd

### Intel<sup>®</sup> Cilk Plus

- Array notation
- Elemental fn.

### Intel® ArBB

- Unified model for SIMD and threads

## Intel<sup>®</sup> Libraries

Identify fixed functionality and employ optimized code, threads, and (with Intel® MKL) multiple nodes.

## Baseline

Recompilation of the existing code.

### Intel<sup>®</sup> Compiler

- Performance comparison with other compilers.

### Intel® IPP

- Multi-media
- etc.

## Intel® MKL

- Statistics (VSL) - BLAS
- etc.

### - Intel TBB - Intel Cilk Plus

Intel<sup>®</sup> Parallel

**Building Blocks** 

Multithreading Achieve scalability

across multiple

nodes.

cores, sockets, and

**Intel® Compiler** 

- Auto/guided par.

- Intel ArBB

- OpenMP\*

#### Intel<sup>®</sup> Cluster Studio

- Cluster tools
- MPI

The order of the steps is suggested to be based on a performance analysis.



Software & Services Group **Developer Products Division** 



# **Intel® Parallel Studio XE**

| Phase                       | Tool                                | Usage                                                                                      | Benefit                                                                                                          |  |
|-----------------------------|-------------------------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|--|
| Compilation<br>Debugging    | Intel®<br>Composer<br>XE            | C/C++ and Fortran<br>Compiler, Performance<br>Libraries and parallel<br>programming models | Strong step towards<br>higher performance (ad-<br>hoc and in the future),<br>additional robustness and<br>safety |  |
| Verification<br>Correctness | Intel®<br>Inspector<br>XE           | Debugging (memory<br>access and thread usage)<br>for better code /<br>application quality  | Higher productivity, early<br>or continuous quality<br>assurance (GUI+CLI)                                       |  |
| Analysis<br>Tuning          | Intel®<br>VTune™<br>Amplifier<br>XE | Profiler to inspect<br>hardware events<br>(counter), scalability etc.                      | Avoids work based on<br>guesses, combines ease<br>of use with deep insight<br>(GUI+CLI)                          |  |

*Powerful compilers. Verification and performance analysis tools supporting continuous integration.* 



Software & Services Group Developer Products Division



# Intel® Compiler

Intel<sup>®</sup> C/C++ Compiler Version 12 Intel<sup>®</sup> Fortran Compiler Version 12



Software & Services Group Developer Products Division



# **SIMD Vectorization**





Software & Services Group Developer Products Division



# **SIMD Vectorization**







# **Basic Vectorization – Switches [1]**

## {L&M} -x<extension> {W}: /Qx<extension>

- Targeting Intel® processors specific optimizations for Intel® processors
- Compiler will try to make use of all instruction set extensions up to and including <extension>; for Intel® processors only!
- Processor-check added to main-program
- Application will not start (will display message), in case feature is not available

## {L&M}: -m<extension> {W}: /arch:<extension>

- No Intel processor check; does not perform Intel-specific optimizations
- Application is optimized for and will run on both Intel and non-Intel processors
- Missing check can cause application to fail in case extension not available

## {L&M}:-ax<extension> {W}:/Qax<extension>

- Multiple code paths a 'baseline' and 'optimized, processor-specific' path(s)
- Optimized code path for Intel® processors defined by <extension>
- Baseline code path defaults to -msse2 (Windows: /arch:sse2); can be modified by -m or -x (/Qx or /arch) switches



Software & Services Group **Developer Products Division** 



# **Basic Vectorization – Switches [2]**

The default is -msse2 (Windows: /arch:sse2)

- Activated implicitly for -O2 or higher
- Implies the need for a target processor with Intel® SSE2
- Use -mia32 (/arch:ia32) for 32-bit processors without SSE2 (e.g. Intel<sup>®</sup> Pentium<sup>™</sup> 3) to adjust baseline code path

Special switch -xHost (Windows: /QxHost)

- Compiler checks host processor and makes use of latest instruction set extension available
- Avoid for builds being executed on multiple, unknown platforms Multiple extensions can be used in combination: -ax<ext1>,<ext2> (Windows: /Qax<ext1>,<ext2>)
- Can result in more than 2 code paths (incl. baseline code path)
- Use -mia32 (/arch:ia32) for 32-bit processors without SSE2 (e.g. Intel® Pentium<sup>™</sup> 3) to adjust baseline code path



Software & Services Group Developer Products Division



# **Vectorization – More Switches/Directives**

**Disable vectorization** 

- Globally via switch: {L&M}: -no-vec {W}: /Qvec-
- For a single loop: directive **novector** 
  - Disabling vectorization here means not using packed SSE/AVX instructions. The compiler still might make use of the corresponding instruction set extensions.

Enforcing vectorization for a loop (overwrite compiler heuristics) #pragma vector always

- will enforce vectorization even if the compiler thinks it is not profitable to do so (e.g due to non-unit strides or alignment issues)
- Will not enforce vectorization if the compiler fails to recognize this as a semantically correct transformation
- Using directive #pragma vector always assert will print error message in case the loop cannot be vectorized and will abort compilation



Software & Services Group Developer Products Division



# **Vectorization Report**

- Provides details on vectorization success & failure
  - L&M: -vec-report<n>, n=0,1,2,3,4,5
  - W: /Qvec-report<n>, n=0,1,2,3,4,5

```
35: subroutine fd( y )
36: integer :: i
37: real, dimension(10), intent(inout) :: y
38: do i=2,10
39: y(i) = y(i-1) + 1
40: end do
41: end subroutine fd
```

```
novec.f90(38): (col. 3) remark: loop was not vectorized: existence of
vector dependence.
novec.f90(39): (col. 5) remark: vector dependence: proven FLOW
dependence between y line 39, and y line 39.
novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized: existence of
vector dependence
```



Software & Services Group Developer Products Division



# **Diagnostic Level of Vectorization Switch** L&M: -vec-report<N> W: /Qvec-report<N>

| Ν | Diagnostic Messages                                                       |
|---|---------------------------------------------------------------------------|
| 0 | No diagnostic messages; same as not using switch and thus default         |
| 1 | Report about vectorized loops- default if switch is used but N is missing |
| 2 | Report about vectorized loops and non-vectorized loops                    |
| 3 | Same as N=2 but add add information on assumed and proven dependencies    |
| 4 | Report about non-vectorized loops                                         |
| 5 | Same as N=4 but add detail on why vectorization failed                    |

## Note:

 In case inter-procedural optimization (-ipo or /Qipo) is activated and compilation and linking are separate compiler invocations, the switch needs to be added to the link step



Software & Services Group Developer Products Division



# **Compiler as a Tool**



## Simplifies programmer effort in application tuning



Software & Services Group Developer Products Division



# **Vectorization Example**

```
void mul(NetEnv* ne, Vector*
   rslt
 Vector* den, Vector* flux1,
 Vector* flux2, Vector* num
{
  float *r, *d, *n, *s1, *s2;
  int i;
  r=rslt->data;
  d=den->data;
  n=num->data;
  s1=flux1->data;
  s2=flux2->data;
  for (i = 0; i < ne->len; ++i)
```

r[i] = s1[i]\*s2[i] + n[i]\*d[i]; GAP Messages (simplified):

- Use a local variable to store the upper-bound of loop at line 29 (variable: ne->len) if the upper-bound does not change during execution of the loop
- 2. Use "#pragma ivdep" to help vectorize the loop at line 29, if these arrays in the loop do not have cross-iteration dependencies: r, s1, s2, n, d
- -> Upon recompilation, the loop will be vectorized



}

Software & Services Group Developer Products Division



## **User-Mandated Vectorization**

User-mandated vectorization is based on a new SIMD Directive

- The SIMD directive provides additional information to compiler to enable vectorization of loops (at this time only inner loop)
- Supplements automatic vectorization but differently to what traditional directives like IVDEP, VECTOR ALWAYS do, the SIMD directive is more a command than a hint or an assertion: The compiler heuristics are completely overwritten as long as a clear logical fault is not being introduced

Relationship similar to OpenMP versus automatic parallelization:





Software & Services Group Developer Products Division



# **SIMD Directive Notation**

C/C++: #pragma simd [clause [,clause] ...] Fortran: !DIR\$ SIMD [clause [,clause] ...]

Without any additional clause, the directive enforces vectorization of the (innermost) loop

```
Example:
void add_fl(float* a, float* b, float* c, float* d, float* e, int n)
{
    #pragma simd
    for (int i=0; i<n; i++)
        a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
}</pre>
```

Without the SIMD directive, vectorization will fail (too many pointer references to do a run-time overlap-check).



Software & Services Group Developer Products Division



# Intel® Parallel Building Blocks

Intel Cilk Plus Intel TBB Intel ArBB



Software & Services Group Developer Products Division



# Intel® Parallel Building Blocks (Intel PBB)

- Programming models in Intel PBB
  - Composable, choice to mix and match
  - Portable, and performance portable
  - Productive, and safe
- Parallel patterns

A parallel pattern is a commonly occurring combination of task distribution and data access.

- 1. A small number of patterns can support a wide range of applications.
- 2. Supporting "good" patterns directly leads to **higher productivity**.
- 3. In addition, a useful subset of "good" patterns are structured and deterministic.



Software & Services Group Developer Products Division



# **Parallel Patterns**

Stencil



Partition



Gather/scatter





Scan and Recurrence

Pipeline

Speculative

selection

• Map

Reduce

 Pack & Expand



Nest



 Search & Match





Software & Services Group Developer Products Division



# **Parallel Patterns in Intel PBB**

# Cilk Plus

- cilk\_spawn: nesting (fork-join)
- Hyperobjects: reduction
- cilk\_for, elemental functions: map
- Array notation: scatter, gather

# Threading Building Blocks

- parallel\_invoke, task-graph: nesting (fork-join)
- parallel\_for, parallel\_foreach: map
- parallel\_do: workpile (map + incr. task addition)
- parallel\_reduce, parallel\_scan: reduce, scan
- parallel\_pipeline: pipeline

## Array Building Blocks

- Elemental functions: map
- Collective operations: reduce, scan
- Permutation operations: pack, scatter, gather





# Intel® Cilk<sup>™</sup> Plus

Elemental Functions and Array Sections Intel Cilk Plus, est. 2010



Software & Services Group Developer Products Division



# **Elemental Functions and Array Sections**

- Elemental functions ("kernels")
  - Properties apply to guarantee vectorization
- Array notation (sections/slices)
  - [start:size], or
  - [start:size:stride]



Software & Services Group Developer Products Division



# **Example: Element-wise Addition**

```
declspec(vector) void kernel(int& result, int a, int b)
  result = a + b;
void sum(const int* begin, const int* begin2,
         std::size t size, int* out)
  cilk_for (std::size_t i = 0; i < size; ++i) {</pre>
    kernel(out[i], begin[i], begin2[i]);
void sum2(const int* begin, const int* begin2,
          std::size t size, int* out)
  kernel(out[0:size], begin[0:size], begin2[0:size]);
```



Software & Services Group Developer Products Division



# Intel® Array Building Blocks Intel Arbb



Software & Services Group Developer Products Division



# **ArBB Overview**

| C++ Library     | <ul><li>Embedded Language</li><li>Dynamic Compiler</li></ul>                                     |  |
|-----------------|--------------------------------------------------------------------------------------------------|--|
| Vector-parallel | <ul> <li>Containers and parallel operations</li> <li>Implicit use of SIMD and threads</li> </ul> |  |
| Scalable        | <ul> <li>Scalable across diff. SIMD widths</li> <li>Scalable across multiple cores</li> </ul>    |  |
| Deterministic   | <ul> <li>Independent of #cores utilized</li> <li>No sync. objects, no data races</li> </ul>      |  |
| Safe            | <ul> <li>Separate memory space (data)</li> <li>No pointers, by-value semantic</li> </ul>         |  |



Software & Services Group Developer Products Division



# **Programming Interfaces**





Software & Services Group Developer Products Division



# **Development Cycle and Tools**



Roadmap: "emulation mode" and native code execution will be aligned (same precision, debugging native code).



Software & Services Group Developer Products Division



# **Vector Processing and Elemental Functions**

# Vector Processing





**Elemental Processing** 

```
void e(f32& result, f32 a, f32 b)
{
    result = a * b;
}
```

```
dense<f32> result, a, b;
call(f)(result, a, b);
```

```
"Kernel"
```

Elemental processing is naturally embedded into a more general vector processing context allowing e.g. scatter etc.



Software & Services Group Developer Products Division



# **ArBB Function?**

# void function(signature)

- Void
- Signature

Types

- Multiple results possible
- Via argument list

# **Outputs/in-outs**

- By "non-const reference" Inputs
- By "const reference", or
- By-value
- Intel ArBB Types

# arbb::call()

In a narrow sense an "ArBB function" does not exist since it is regular C++ code.



Software & Services Group Developer Products Division



# **JIT Compiler and Code Generation**





Software & Services Group Developer Products Division



# **JIT Compiler and Code Generation**





Software & Services Group Developer Products Division



# **Scalar Types**

| Туре              | Description                    | C++                                               |
|-------------------|--------------------------------|---------------------------------------------------|
| f32, f64          | 32/64 bit floating point       | float, double                                     |
| i8, i16, i32, i64 | 8/16/32 bit integer (signed)   | signed char,<br>short, int                        |
| u8, u16, u32, u64 | 8/16/32 bit integer (unsigned) | unsigned char,<br>unsigned short,<br>unsigned int |
| boolean           | Boolean value (true/false)     | bool                                              |
| usize, isize      | Index type                     | size_t (ssize_t)                                  |



Software & Services Group Developer Products Division



### **Operations**





Software & Services Group Developer Products Division



## **Data Management**



#### Segregated storage and memory management

- Transparent on remote execution
- No pointers, logically "by value"



Software & Services Group Developer Products Division



## **Data Management**



- Type of container elements
  - Scalar type (Intel ArBB)
  - Collection of scalar types (structured, heterogeneous)
  - Nested structure of the above types
- "Arrays of Structures" (AoS) are stored as "Structures of Arrays" (SoA) internally



Software & Services Group Developer Products Division



## **Example: Mandelbrot**

```
int max count = 4711;
void mandel(i32& d, std::complex<f32>
  i32 i;
  std::complex<f32> z = 0.0f;
 for (i = 0, i < max count, i++) {</pre>
   if (abs(z) >= 2.0f) {
     break;
    } end if;
    z = z * z + c;
  } end for;
  d = i;
}
void doit(dense<i32,2>& d, const dense<std::complex<f32>,2>& c)
  map(mandel)(d, c);
```

```
bind(pos, c pos, cols, rows);
bind(dest, c_dest, cols, rows);
call(doit)(dest, pos);
```



Software & Services Group **Developer Products Division** 





Color Legend: ArBB State (Type) ArBB Behavior

## **High-Level Optimizations**



- Intel® ArBB operators: separately written, but <u>fused code</u> across call boundaries (inlining, despite of modularization)
  - Higher arithmetic intensity per task
  - Higher memory locality
  - Less scheduling overhead
  - Less synchronization



Software & Services Group Developer Products Division



## **Example: Matrix-Vector Multiplication**

#### How to do it in C++?



}

Software & Services Group Developer Products Division



## **Example: Matrix-Vector Multiplication**

*It is often possible to eliminate loops entirely. Express what to do, instead of how to do it.* 



Software & Services Group Developer Products Division



## **Example: Matrix-Vector Multiplication**

... But loops are just fine in cases where the loop body contains enough parallel work

} \_end\_for;



Software & Services Group Developer Products Division



## **Example: Query Value Locations**

```
dense<boolean> hits = in == value;
dense<usize> pos = indices(0, in.length());
result = pack(pos, hits);
```



Software & Services Group Developer Products Division



## **Intel® Array Building Blocks**

## Download Intel ArBB and try it!

http://intel.com/go/arbb/

## Documentation, articles, and user forum

http://software.intel.com/en-us/articles/intel-array-building-blocks-documentation/ http://software.intel.com/en-us/articles/intel-array-building-blocks-kb/all/ http://software.intel.com/en-us/forums/intel-array-building-blocks/



Software & Services Group Developer Products Division



# **Questions?**



Software & Services Group Developer Products Division





# Software



Software & Services Group Developer Products Division



## **Optimization Notice**

#### **Optimization Notice**

Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20110307



Software & Services Group Developer Products Division



## **Legal Disclaimer**

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. \*Other names and brands may be claimed as the property of others.

Copyright © 2011. Intel Corporation.

#### http://intel.com/software/products



Software & Services Group Developer Products Division



# **Parallel Patterns**

**Backup Slides** 



Software & Services Group Developer Products Division



## (Serial) Sequence



A serial sequence is executed in the exact order given:

$$B = f(A);$$
  
 $C = g(B);$   
 $E = p(C);$   
 $F = q(A);$ 



Software & Services Group Developer Products Division



## Superscalar Sequence (Task Graph)



Developer writes "serial" code:

$$B = f(A);$$
  

$$C = g(B);$$
  

$$E = f(C);$$
  

$$F = h(C);$$
  

$$G = g(E,F);$$
  

$$P = p(B);$$
  

$$Q = q(B);$$
  

$$R = r(G, P, Q);$$

- However, tasks only need to be ordered by data dependencies
- Depends on limiting scope of data dependencies
- Variants: fork-join, general DAG



Software & Services Group Developer Products Division



## Map (Embarrassing Parallelism, SPMD)



**Examples:** gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing.  Map replicates a function over every element of an index set (which may be abstract or associated with the elements of an array).

A = map(f,B);

 This replaces one specific usage of iteration in serial programs: processing every element of a collection with an independent operation.



Software & Services Group Developer Products Division



## Reduction



**Examples:** averaging of Monte Carlo samples; convergence testing; image comparison metrics; subtask in matrix operations. *Reduce* combines every element in a collection into one element using an associative operator.

b = reduce(f,B);

- For example, *reduce* can be used to find the sum or maximum of an array.
- There are some variants that arise from combination with *partition* and *search*



Software & Services Group Developer Products Division



## **Partition (Geometric Decomposition)**



**Examples:** JPG and other macroblock compression; divideand-conquer matrix multiplication; coherency optimization for conebeam recon.

- Partition breaks an input collection into a collection of collections
- Useful for divide-andconquer algorithms
- Variants:
  - Uniform
  - Non-uniform
  - Overlapping (read-only)

#### • Issues:

- How to deal with boundary conditions?
- Partitioning does't move data, it just provides an alternative "view" of its organization



Software & Services Group Developer Products Division



## Stencil



- *Stencil* applies function to all neighbourhoods of an array
- Neighbourhoods given by set of relative offsets
- Optimized implementation requires blocking and sliding windows
- Boundary conditions on array accesses need to be considered

**Examples:** image filtering including convolution, median, anisotropic diffusion; simulation including fluid flow, electromagnetic, and financial PDE solvers, lattice QCD



Software & Services Group Developer Products Division



## **Nesting: Recursive Composition**





Software & Services Group Developer Products Division



## (Serial) Selection



The condition is evaluated first, then one of two tasks is executed based on the result.

> IF (c) { f } ELSE { g }



Software & Services Group Developer Products Division



## **Speculative Selection**



Examples: collision culling; ray tracing; clipping; discrete event simulation; search Both sides of a conditional and the condition are evaluated in parallel, then the unused branch is cancelled.

f } ELSE { g

- Effort in cancelled task "wasted"
- Use only when a computational resource would otherwise be idle, or tasks are on critical path



Software & Services Group Developer Products Division



## Recurrences



**Examples:** infinite impulse response filters; sequence alignment (Smith-Waterman dynamic programming); matrix factorization

- Recurrences arise from the data dependency pattern given by nested loop-carried dependencies.
- nD recurrences can be parallelized over n-1 dimensions by Lamport's hyperplane theorem
- Execution of parallel slices can be performed either via iterative map, wavefront parallelism, or polyhedral decomposition



Software & Services Group Developer Products Division



## **Partitioned (Blocked) Recurrences**



- Implementation can use partitioning for higher performance
- When combined with the "pipeline" pattern recurrences implement "wavefront" computation.
- Polyhedral theory generalizes this, can be used to drive recursive decompositions



Software & Services Group Developer Products Division







- Scan computes *all* partial reductions
- Allows parallelization of many 1D recurrences
- Requires an associative operator
- Requires 2n work over serial execution, but lg n steps

**Examples:** integration, sequential decision simulations in financial engineering, can also be used to implement pack



Software & Services Group Developer Products Division



## **Pipeline**

- Tasks can be organized in chain with local state
- Useful for serially dependent tasks like codecs
- Whole chain applied like map to collection or stream
- Implementation of many sub-patterns may be optimized for pipeline execution when inside this pattern



**Examples:** codecs with variable-rate compression; video processing; spam filtering.



Software & Services Group Developer Products Division



Pack



- Pack allows deletion of elements from a collection and elimination of unused space
- Useful when fused with map and other patterns to avoid unnecessary output

**Examples:** narrow-phase collision detection pair testing (only want to report valid collisions), peak detection for template matching.



Software & Services Group Developer Products Division



## Expand



**Examples:** broad-phase collision detection pair testing (want to report potentially colliding pairs); compression and decompression.

- Expand allows element of map operation to insert any number of elements (including none) into its output stream
- Useful when fused with map and other patterns to support variable-rate output



Software & Services Group Developer Products Division



## Search/Match



**Examples:** computation of metrics on segmented regions in vision; computation of web analytics

- Searching and matching fundamental capabilities; may depend indirectly on sorting or hashing
- Use to select data for another operation, by creating a (virtual) collection or partitioned collection.
- Example: category reduction reduces all elements in an array with the same "label", and is the form used in Google's mapreduce



Software & Services Group Developer Products Division



## Gather

- Map + Random Read
  - Read from a random (computed) location in an array
  - When used inside a map or as a collective, becomes a parallel operation
  - Views into arrays, but no global pointers
  - Write-after-read semantics for kernels to avoid races





Software & Services Group Developer Products Division



## **!Scatter**

#### Map + Random Write

- Write into a random (computed) location in an array
- When used inside a map, becomes a parallel operation
- Race conditions possible when there are duplicate write addresses ("collisions")
- To obtain deterministic scatter, need a deterministic rule to resolve collisions

# A B C D F Examples: marking pairs in collision detection; handling database update transactions. C A F B 1 5 0 2 4



Software & Services Group Developer Products Division



## **\*Permutation Scatter**

#### **Option 1:** Make collisions *illegal*

- Only guaranteed to work if no duplicate addresses
- Danger is that programmer will use it when addresses do in fact have collisions, then will depend on undefined behaviour
- Similar safety issue as with out-of-bounds array accesses.
- Can test for collisions in "debug mode"





Software & Services Group Developer Products Division



## **!Atomic Scatter**

## **Option 2:** Resolve collisions atomically but *non-deterministically*

- Use of this pattern will result in non-deterministic programs
- Structured nature of rest of patterns makes it possible to test for race conditions





Software & Services Group Developer Products Division



## **Merge Scatter**

**Option 3:** Use an associative operator to combine values upon collision

- Problem: as with reduce, depends on programmer to define associative operator
- Gives non-deterministic read-modify-write when used with non-associative operators
- Due to structured nature of other patterns, can still provide tool to check for race conditions.





Software & Services Group Developer Products Division



## **Priority Scatter**

**Option 4:** Assign every parallel element a priority

- NOTE: Need hierarchical structure of other patterns to do this
- **Deterministically** determine "winner" based on priority
- When converting from serial code, priority can be based on original ordering, giving results consistent with serial program
- Efficient implementation is similar to hierarchical z-buffer...





Software & Services Group Developer Products Division



## **Other Parallel Patterns**

- Fork-join: special case of nesting
- Workpile: extension of map where tasks can be added dynamically
- Branch-and-bound: non-deterministic search where other branches can be terminated once once a "good enough" solution found
- **Incremental graph update:** propagation of updates through DAG or graph (latter may not terminate, however...)
- Graph rewriting: can be used to implement functional languages.

Transactions: non-deterministic database updates



Software & Services Group Developer Products Division

