## Research/Blog

# Compiler Level Optimizations for Accelerated Deep Learning

- April 30, 2020
- Posted by: vsinghal
- Category: AI Hardware Deep Learning

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling*

Last Saturday (25th Apr ’20), our AI Lab Researcher **Darshan G.** presented a fabulous hands-on workshop on Tensor Fusion, Accelerated Linear Algebra (XLA) and many other techniques to speed up Deep Learning computations.

The following discussion covers many of these techniques.

### TensorFlow Graph Concepts :-

TensorFlow (v1.x) programs generate a DataFlow (directed, multi –) **Graph**. These are represented by Device independent intermediate program representation (IR = Intermediate Representation).

TensorFlow v2.x uses a mix of imperative ( Eager ) execution mode and graphs functions.

These Graph nodes represent operations “Ops” (such as Add, MatMul, Conv2D, …). This represents Abstract device-, execution backend-, and language-independent API.

The Graph is in turn implemented by Op Kernels written in C++, specialized on <Type, Device> platform.

Graph edges represent “data” flowing between ops :-

- Tensors (ref counted, n dimensional array buffers in device memory).
- Control dependencies : A –>B means A must finish before B can run.
- Resource handles to state (e.g. variables, input data pipelines).

### Grappler: TensorFlow Concept :-

Grappler, a TensorFlow concept, performs graph optimizations through a top level driver called the Metagraph.

So Grappler is a Default graph optimization system in the TF runtime

- Re-writes graphs to improve out of the box TensorFlow performance
- Provides a plugin infrastructure to register custom optimizers/rewriter

The main goals of the Grappler are :-

- Automatically improve TF performance through graph simplifications & high level optimizations that benefit most target HW architectures (CPU/GPU/TPU/mobile etc.)
- Reduce device peak memory usage to enable larger models to run
- Improve hardware utilization by optimizing the mapping of graph nodes to compute resources.

### Remapper optimizer: Op fusion

The Remapper replaces commonly occurring subgraphs with optimized **fused “monolithic” kernels**.

Examples of patterns fused:

- Conv2D + BiasAdd + <Activation>
- Conv2D + FusedBatchNorm + <Activation>
- Conv2D + Squeeze + BiasAdd
- MatMul + BiasAdd + <Activation>

Fusing ops in this way together provides several performance advantages:

- Completely eliminates Op scheduling overhead.
- Increases opportunities for ILP, vectorization etc.
- Improves temporal and spatial locality of data access.

The fused kernels are convenient and improve the performance.

Here is some sample code around fusion concept.

### Kernel Fusion

This is an optimization technique applied to a group of GPU kernels to increase efficiency by decreasing execution time, power consumption.

Note that GPU kernels cannot be scheduled once launched in device. Kernel fusion can rearrange and schedule the kernels from the host side.

Kernels using the same or different data array(s) can be replaced with a single kernel call. The new kernel aggregates the code segments of the separate kernels.

Let’s look at overall architecture of TensorFlow with respect to backend customizations.

TF can run in these different forms :-

### Accelerated Linear Algebra (XLA)

XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.

When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a precompiled GPU kernel implementation that the executor dispatches to.

XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization.

**Fusion **is XLA’s single most important optimization. Memory bandwidth is typically the scarcest resource on hardware accelerators, so removing memory operations is one of the best ways to improve performance.

(Ref : https://www.tensorflow.org/xla)

The results are significant improvement in speed and memory usage as shown in the chart below :-

Now let’s see the TF-Level Block Diagram :

As seen in the diagram above, some operations can be optimized with Accelerated Linear Algebra (XLA), while others are executed by TensorFlow Core as usual.

XLA in one picture :-

The XLA Graph is converted to **Intermediate Representation (IR)** which generates device-centric code.

TensorFlow in one picture :-

Above diagram shows TensorFlow directly running Kernel Ops.

Now how does the TF graph gets changed when it is optimized with XLA :-

#### tf2xla: Symbolic Graph Execution :-

In the figure above, the complex Softmax operation (which is heavy on computation) is broken down into simpler lower level operations.

#### TensorFlow with XLA :-

The input language to XLA is called “HLO IR”, or just **HLO (High Level Optimizer)**.

XLA takes graphs (“computations”) defined in HLO and compiles them into machine instructions for various architectures. XLA is modular in the sense that it is easy to slot in an alternative backend to target some novel HW architecture. The CPU backend for x64 and ARM64 as well as the NVIDIA GPU backend are in the TensorFlow source tree.

#### HLO IR :-

HLO is a language in itself.

->Sample HLO ops :-

- Elementwise math
- Add, Tanh, Map

- Specialized math for neural nets
- Dot, Convolution, Reduce

- Re-organize data
- Reshape, Broadcast, Concat, Tuple

- Data transfer
- Parameter, Constant

Sample data types in HLO :-

- Primitive types
- PRED
- F16
- F32

- Composite types
- array: F32[2,3], F16[]
- tuple:

- TUPLE(F32[16], F16)

Let’s see how the **ReLU HLO** might work :-

The XLA compilation framework is invoked on subgraphs of TensorFlow computations.

The goals of XLA are :-

- Improved execution speed
- Improved tensor buffer memory
- Reduce reliance on custom
- Much improved mobile footprint
- Improved portability

### MLIR (Multi-Level Intermediate Representation) :-

The TensorFlow ecosystem contains a number of compilers and optimizers that operate at multiple levels of the software and hardware stack.

**MLIR **is a uniform architecture for compiler infrastructure. The gradual adoption of MLIR will simplify every aspect of the TensorFlow stack.

### GEMM :-

GEMM (GEneral Matrix Multiplication) is considered to be the core computational kernel in Deep Learning being used in Fully Connected Layers and Convolutional Layers. GEMM is a linear algebra routine used for Compiler-Level Matrix Multiplication Optimization for Deep Learning.

Several optimized versions of this has been developed for GPU computing architectures.

Below is some code for GEMM Optimization – Tiling :

### GPU Kernel Optimizations :-

Here are some techniques for GPU Kernel Optimizations :-

- Memory Access Coalescing.
- Optimising Reduction Kernels.
- Kernel Fusion.
- Thread Corscening.
- Asynchronous Execution.

We will address these via other articles on the CellStrat Research blog.

### DNN Pruning and Sparse Matrix Operations (Sparse Matrix Vector Multiplication: SpMV) :-

Several works have been proposed over the years that focus on pruning the weights of a neural network.

This essentially implies that the weight matrices are now sparse in nature.

Implementations for sparse matrix operations would be beneficial in this context.

Below is an example of Sparse Matrix Vector Multiplication.

#### CSR Format Example :-

### Distributed Training :-

In the first figure above, different parts of the network are placed on different devices. In the second figure, parts of data are passed to different devices with all devices having the complete network.

#### Parallelism :-

Here we split sets of layers across multiple devices. In the figure above, Layer 0,1,2 and layer 3,4,5 are on different devices

In the figure above, we split individual layers across multiple devices. Both devices compute different parts of Layer 0,1,2,3,4,5.

The figure above shows combined effect of Inter and Intra Parallelism.