

(GSI-Scientific Computing/Panda Collaboration)

### **FairRoot Developers:**

An International Accelerator Facility for Research with ions and Antiprote

## **Core Team:**

Mohammad Al-Turany Denis Bertini Florian Uhlig Radek Karabowicz Dmytro Kresan Tobias Stockmanns

the project started end of

2003

SC/CBM/PANDA SC/CBM/R3B SC/CBM SC/PANDA SC/R3B PANDA

2

New Students working on Online issues:Andrias HertenFZJ (PANDA)Dennis KleinKOSI (SC)

HADES

PANDA

CBM

10/29/12

**Volker Friese** 

**Olaf Hartman** 

Ilse König

#### **FairRoot : Timeline**





# Using GPUs for Online applications in FairRoot













### Which hardware? Which Software?

- What ever hardware we choose for now it will be obsolete in 5 years
- The question:

### *"How to parallelize the online reconstruction?"* should have the highest priority, after that we can implement the code on any suitable hardware that is available at that time

 In FairRoot we decided to use the currently most suitable hardware for the problem we try to solve (Experiment online tracking)

### Why NVIDIA?

- Support a true cache hierarchy in combination with on-chip shared memory
- Support ECC, it detects and corrects errors before system is affected. It also Protects register files, shared memories, L1 and L2 cache, and DRAM
- Limited, but increasing support of C++
- Concurrent Kernel Execution
- Tesla product family is designed ground-up for parallel computing and offers exclusive computing features.

### Why NVidia? Architectural differences

#### ATI

- Adopts very long instruction word (VLIW) processors to carry out computations in a vector-like fashion (Performance of programs largely depends on there packing ratio)
- The L1 cache on the HD 5870 can only be used to cache image objects and constants.
- Only image objects and constants use the L2 in HD 5870

#### NVidia

- Use multi-threading execution to execute code in a Single-Instruction-Multiple-Thread (SIMT) fashion and explores thread-level parallelism to achieve high performance
- The L1 cache is configurable to different sizes and can be disabled by setting a compiler flag.
- All global memory accesses go through the L2 in GTX 580



#### Why NVidia? GPUDirect:

#### Without GPUDirect

Same data copied three times:

- 1. GPU writes to pinned sysmem1
- 2. CPU copies from sysmem1 to sysmem2
- 3. InfiniBand driver copies from sysmem2

#### With GPUDirect

Data only copied twice Sharing pinned system memory makes sysmem-to-sysmem copy unnecessary





#### Why NVidia: Enables a direct path for communication between the GPU and a peer device using standard features of PCI Express



RDMA for GPUDirect within the Linux Device Driver Model

http://developer.download.nvidia.com/compute/cuda/5\_0/rc/docs/GPUDirect\_RDMA.pdf



#### Why NVidia? Dynamic Parallelism :

Adds the capability for the GPU to generate new work for itself

- All ATI cards
- OpenCL with all cards
- Older NVidia



**Dynamic Parallelism** GPU Adapts to Data, Dynamically Launches New Threads



#### **Dynamic Parallelism** Makes GPU Computing Easier & Broadens Reach



### Why CUDA?

- Open Source
- CUDA is an architecture designed to let you do your work, rather than forcing your work to fit within a limited set of performance libraries.
- The software development environment is reasonable and straightforward
- Limited, but increasing support of C++





# CUDA has a very useful open source, infrastructure libraries

- Thrust: Allows you to program parallel architectures using an interface similar to the C++ Standard Template Library (STL).
- CUBLAS: Basic Linear Algebra Subprograms
- CURAND: Simple and efficient generation of high-quality pseudorandom and quasirandom numbers
- **CUSPARSE**: A set of basic linear algebra subroutines used for handling sparse matrices
- **CUFFT**: Fast Fourier Transform (FFT) library.

### What about OpenCL on NVIDIA?

- OpenCL is portable, but it is not fully performance portable (there's a bunch of papers that states exactly this, also across GPU vendors)
- Concurrent Kernel Execution is only available with CUDA and NVIDIA (see <a href="http://devgurus.amd.com/thread/159535">http://devgurus.amd.com/thread/159535</a>)
  - Simultaneous execution of small kernels utilize whole GPU.
  - $_{\circ}~$  Overlapping kernel execution with device to host memory copy.
- In CUDA it is possible to reconfigur memory



### Activities@GSI for PANDA: Oleksiy Rybalchenko & Mohammad Al-Turany

- We port the code developed at Giessen for FPGA to GPU
- Work is ongoing on:
  - Comparing the performance, scalability, re-usability, ...etc, with other hardware and/or software techniques on the market
  - We plan to build a computing node prototype based on GPU that can make tracking on the fly.



#### Porting track finder/Fitter to CUDA

Original code is optimized for FPGA Lookup tables are used for the mathematical functions (Code is designed to work on FPGA)

Redesign the code into many functions (kernels)

Use the standard mathematical libraries delivered by NVIDIA

10/29/12

#### Profiler output for GPU time for each kernel

Gpu Time Summary Plot



#### The results are comparable:

#### **CPU Results**

#### **GPU Results**

found tracks in first step: collect similar tracks: number of tracks: recalculate tracks: number of tracks:

| 1615 | found tracks in first step: | 1609 |
|------|-----------------------------|------|
| done | collect similar tracks:     | done |
| 158  | number of tracks:           | 151  |
| done | recalculate tracks:         | done |
| 14   | number of tracks:           | 14   |

#### Using Texture memory for field maps:

#### Track propagation (RK4) using PANDA dipole Field



#### Speedup : up to factor 175

10/29/12



10/29/12

Some NVidia's specific features could make factors in performance depending on the problem we deal with!

Track + Vertex Fitting (PANDA): The same program code, same hardware, just using Pinned Memory instead of Mem-Copy! CPU time/GPU time

| Track/Event     | 50  | 100 | 1000 | 2000 |
|-----------------|-----|-----|------|------|
| GPU             | 3.0 | 4.2 | 18   | 18   |
| GPU (Zero Copy) | 15  | 13  | 22   | 20   |





Copy data

Execute

### At this time:



- CUDA is fully integrated into the FairRoot build system
- CMake creates shared libraries for cuda par
- FairCuda is a class which wraps CUDA implemented functions so that they can be used directly from ROOT CINT or compiled code
- Similar to FairCuda we could do: FairOpenCL, but till today we did not see one single working algorithm in OpenCL for any experiment! As soon as this change we can easily support it.



### **Event source simulation**





### Event source simulations (Dennis Klein & Mohammad Al-Turany)





Finally

What ever we use now, the valuable part - figuring out how to decompose our problem into massively parallel code for simple small cores with very little memory - is going to be pretty easily portable between programming models.





### **Backup and Discussion**



- A socket library that acts as a concurrency framework.
- Faster than TCP, for clustered products and supercomputing.
- Carries messages across inproc, IPC, TCP, and multicast.
- **Connect N-to-N** via fanout, pubsub, pipeline, request-reply.
- A synch I/O for scalable multicore message-passing apps.
- 30+ languages including C, C++, Java, .NET, Python.
- Most OSes including Linux, Windows, OS X, PPC405/PPC440.
- Large and active open source community.
- LGPL free software with full commercial support from iMatix.





10/29/12



#### **Kepler Memory Hierarchy**

- Kepler GK110 also enables compiler directed use of an additional new cache for read-only data
- In addition to the L1 cache, Kepler introduces a 48KB cache for data that is known to be read-only for the duration of the function. In the Fermi generation, this cache was accessible only by the Texture unit.





### Comparing apples with apples (Commodity Cards)

| ATI     |                         |                 | NVIDIA  |                         |                 |
|---------|-------------------------|-----------------|---------|-------------------------|-----------------|
| card    | Memory<br>(GB)<br>GDDR5 | Price<br>(Euro) | card    | Memory<br>(GB)<br>GDDR5 | Price<br>(Euro) |
| HD 5870 | 1.0                     | 250             | GTX 480 | 1.5                     | 250             |
| HD 6970 | 2.0                     | 280             | GTX 580 | 1.5                     | 300             |
| HD 7970 | 3.0                     | 400             | GTX 680 | 2.0                     | 400             |



#### Comparing oranges with oranges (Professional Cards)

#### **NVIDIA** ATI Card Memory **Price** Card Memory **Price Bandwidth Bandwidth** GB GB Quadro **PCIe** (Euro) **FirePro PCIe** (Euro) (GB/sec) (GB/sec) V7800 2.0x16 2.0 89.6 550 4000 89.6 2.0x16 550 2.0 5000 2.5120 2.0x16 1500 V8800 2.0 89.6 2.0x16 1040 6000 6.0 144 2.0x16 3500 V9800 89.6 2.1x16 2700 4.0

#### **NVIDIA Tesla**

| Card  | GB  | Memory<br>Bandwidth<br>(GB/sec) | PCIe   | Price<br>(Euro) |
|-------|-----|---------------------------------|--------|-----------------|
| C2070 | 6.0 | 144                             | 2.0x16 | 2000            |
| M2050 | 3.0 | 144                             | 2.0x16 | 2200            |
| M2070 | 6.0 | 144                             | 2.0x16 | 2500            |
| M2090 | 6.0 | 144                             | 2.1x16 | 2800            |



**RDMA:** Eliminate CPU bandwidth and latency bottlenecks using direct memory access (DMA) between GPUs and other PCIe devices, resulting in significantly improved MPISendRecv efficiency between GPUs and other nodes

