GPU Architectures and GPGPU Computing
- May 10, 2020
- Posted by: vsinghal
- Category: AI Hardware
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling
Last Friday, our AI Lab Researcher Darshan C G presented a wonderful overview of “GPU Architectures and General-Purpose GPU Computing“.
Performance Improvements are often based on parallelism techniques, which are found everywhere :-
- Pipelining, Instruction-Level Parallelism.
- Vector Processing.
- Array processors/MPP.
- Multiprocessor Systems.
- Multicomputers/cluster computing.
- Graphics Processing Units (GPUs) and other Accelerators.
How do we handle Data Level Parallelism :-
- Vector Processors: early style of Data Parallel Compute (SIMD Style).
- Modern Processors : MMX (Multimedia Extensions), AVX (Advances Vector Extensions).
- GPUs: Multiple SIMD Cores in each Streaming Processors.
Vector Processors :-
- Vector Registers: Each register is a fixed length bank holding a single vector.
- Functional units are also vectorized
- VMIPS has 8 vector registers, and each vector register holds 64 elements, each 64 bits wide.
Some other aspects of Vector Processors :-
- A Vector Instruction passes lot of parallel work to the Hardware.
- The Functional units can be: Fully Parallel, or a combination of parallel and Pipelined units.
- Work for Compilers: Loop Vectorization, Dependency handling.
- Earlier figure depicts a GPU with an array of 128 Scalar Processor Cores, organized as 16 multithreaded Streaming Multiprocessor.
- Each SM has 8 SPs.
- 2 SMs together are arranged as independent processing units
- called texture Processor Clusters(TPC).
- GPU Architectures : Tesla –>Fermi –>Kepler –>Maxwell –>Pascal –>Volta–>Turing –>Ampere
Early GPUs accelerated the Logical Graphics Pipeline.
Note: The Program that is used to draw or shade something on a screen is called Shader. These run on a GPU.
GPUs: massive Multi-threading
- Cover the Latency of memory loads and texture fetches from DRAM.
- Support fine grained parallel graphics shader(and General purpose compute) programming models.
- Simplify the parallel programming model to write a serial program from one thread.
First Generation GPUs:
- GeForce 256, introduced in 1999.
- Contained fixed function vertex, pixel shaders programmed with OpenGL and Microsoft DX7.
- GeForce 3: the first programable vertex processor executing vertex Shaders.
Tesla Architecture :-
We come back to GeForce 8800 GPU with 128 SPs organised as 16 SMs.
- External DRAM Cotrol and fixed function Raster Operation Processors(ROPs) perform colour and depth frame buffer Operations.
- The interconnection network carries computed pixel fragment colors and depth values from SPs to the ROPs.
-> The input assembler collects vertex
-> Vertex work distributor distributes vertex work packets to the various TPCs.
-> The TPCs execute vertex/geometry shader Programs.
-> The output data is written to on chip buffers.
- Each TPC has two SMs, each SM has:
- Eight Scalar Processor(SP) Cores,
- Two special function units(SFUs),
- A multi threaded instruction fetch and issue unit(MT Issue).
- A 16k byte read/write shared memory.
- Each SP Core contains a scalar multiply add(MAD) unit, giving the SM eight MAD units.
- The SM uses its two SFU units for Transcendental functions.
GPU execution model :-
- SIMT architecture is similar to SIMD Design(which applies one instruction to multiple data lanes).
- The difference is that SIMT applies one instruction to multiple independent threads in parallel.
- A SIMT instrution controls the execution and branching behaviour of one thread.
- Each SMs Multithreaded Instruction units creates, manages, schedules, and executes threads in group of 32 parallel threads called Warp.
- Each SM manages a pool of 24 warps,with a total of 768 threads.
Fermi GTX 480 GPU :-
This has :-
- 16 SMs, total 512 CUDA Cores.
- Each SM has 32 SPs, 32768 32 bit registers divided logically across executing threads.
- Each Thread is limited to no more than 64 registers.
- A warp has access to 64*32 registers which re 32 bit.
- Each SM has 4 SFUs, each SP has one FP, one integer ALU.
Fermi SM :-
-> Each SM has 16 load/store
-> Each lane has 2048
-> Each SM has 4 SFUs, each SP has one FP, one Integer ALU.
Fermi Memory Hierarchy :-
-> Share Memory enables threads to Cooperate, facilitates reuse of on Chip data, and reduces Off chip traffic.
-> Each SM has 64kB of on chip memory that can be configured as 48KB of shared memory with 16KB of L1 Cache or as 16KB of Shared Memory with 48KB of L1 Cache.
GPUs found everywhere. These have started finding wide usage in several domains where workloads have become intensive.
- Mobile GPUs: ARM Mali, Adreno GPUs (Qualcomm) – accelerate Graphics as well as Compute Tasks.
- NVIDIA in Embedded Space: Jetson TX1/Nano/AGX Xavier targetting for Deep Learning Tasks.
- NVIDIA Drive: for Implementing Autonomous Cars and ADAS Functionality powered by Deep Learning (Tesla Cars).
Nvidia Jetson Series
-> TX1 SoC incorporates a Quad-Core 2.32 GHZ 32 bit ARM Machine and a Integrated Kepler GPU.
-> The CPUs share 2 MB L2 Cache.
-> The GPU has 192 Cores and 128KB L2 Cache.
Some tips :-
- To go fast, use multiple processors
- To be efficient and fast, use GPUs
- To be efficient and go really fast, use multiple GPUs
GPUs for training :-
Whats new in Volta: Tensor Core
->a new instruction that performs 4x4x4 FMA mixed precision operations per clock 12X increase in throughput for the Volta V100 compared to the Pascal P100
Mixed Precision Training
Pascal vs Volta :-
Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations.
CUDA Programming :-
CUDA means Compute Unified Device Architecture. It is an Extension of C programming Language with special Constructs for supporting Parallel Computing.
From CUDA Programmer Perspective, CPU is a host, that dispatches parallel Jobs to GPU Devices.
CUDA Program Structure
- Host Code for a Host Device(CPU).
- Device Code for GPU.
- Any C Program is a Valid CUDA host Code.
- In General CUDA Programs (host+device) code cannot be compiled by Standard C Compilers. It need NVIDIA C Compilers.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
A first CUDA C program :-
Host Code :-
Device Code :-
Related CellStrat blogs on AI Hardware :-
- Compiler Level Optimizations for Accelerated Deep Learning – (https://bit.ly/CS-CLO)
- Deep Learning Inference Accelerators for Edge Computing – (https://bit.ly/3aSA7eJ)
Interested in our “AI Hardware and Accelerated Computing” research program ? Please feel free to call us at +91-9742800566 !