## Research/Blog

# Deep Learning Inference Accelerators for Edge Computing

- April 3, 2020
- Posted by: vsinghal
- Category: AI Hardware Auto and Manufacturing Deep Learning Driverless Cars IoT Retail Robotics

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling*

**Minutes from Saturday 28th March 2020 AI Lab Meetup :-**

Last Saturday **CellStrat AI Lab Researcher Darshan C G** (Project Assistant, Dept. of Electronics Systems Engineering, IISc, Bengaluru) presented a superb session on **Deep Learning Inference Accelerators**.

A Deep Neural Network (DNN) for image classification looks like this :-

Inference Accelerators (Hardware-facing APIs) on the Edge can speed up DL inference dramatically.

There are 5 critical factors that are used to measure APIs for Inference:

*Throughput*: The volume of output within a given period. Often measured in inferences/second or samples/second.*Efficiency*: Amount of throughput delivered per unit-power, often expressed as performance/watt.*Latency*: Time to execute an inference, usually measured in milliseconds. Low latency is critical to delivering rapidly growing, real-time inference-based services.*Accuracy*: A trained neural network’s ability to deliver the correct answer.*Memory usage*: The host and device memory that need to be reserved to do inference on a network depends on the algorithms used. This constrains what networks and what combinations of networks can run on a given inference platform. This is particularly important for systems where multiple networks are needed and memory resources are limited.

Let’s look at two very popular Deep Learning Inference Accelerators – (1) Nvidia’s TensorRT and (2) TensorFlow Lite.

__Nvidia TensorRT__ :-

__Nvidia TensorRT__:-Nvidia TensorRT is highly optimized Programmable Inference Accelerator :-

**TensorRT performance** is depicted by this chart :-

Let’s look at **TensorRT Deployment** :-

There are many aspects to achieving an **Optimized Inference Engine** in TensorRT. These are depicted below.

How can we optimize TensorRT :-

There are different optimisation methodologies used in TensorRT to accelerate the Inference. The methodologies are as follows:

*Layer and Tensor Fusion:*

Each Ops are mapped to kernels for execution. Multiple kernel launch creates the problem of Overhead. So to overcome this the Layer and Tensor Fusion is implemented.

The fusion can be done in 3 ways:

- Vertical Fusion.
- Horizontal Fusion.
- Layer Elimination.

Behind every of these Fusion and Elimination there are advanced operations that happen and one of them is Accelerated Linear Algebra.

These Fusions may not be applicable to all the Layers in the Network. TensorRT takes care of all of this.

With layer fusion, popular pre-trained models get compressed as follows :-

The following TensorBoard visualization depicts how part of the TensorFlow computation graph is replaced with a single TensorRT node :-

*Precision Calibration:*

TensorRT also does the Precision Calibration. Precision is described as the number of bits to represent the number. Reduced Precision will decrease the memory usage gradually and the model is compressed; as the result the model can be easily stored in the edge devices. This is actually called Quantization. Quantization is the best compression technique which is used in almost all the Inference Accelerators.

Memory FLOPS at INT8 and FP16 outperform FP32 precision with respect to images processed per second.

*Kernel Auto-Tuning:*

There are several ways to implement the Convolution (Matrix Multiplication, Winograd Transformation, FFT, etc.).

Each Implementations will work well for different Architectures. TensorRT picks the best implementation suitable for the particular architecture such that it executes fast. TensorRT will pick the implementation from a library of kernels that delivers the best performance for the target GPU, input data size, filter size, tensor layout, batch size and other parameters.

*Dynamic Tensor Memory:*

This reduces memory footprint and improves memory re-use. Also, it manages memory allocation for each tensor only for the duration of its usage.

A complete solution stack from **TensorFlow to TensorRT** looks like this :-

Let’s review the process of **deploying Tensorflow Models with TensorRT**

We import, optimize and deploy TensorFlow models using TensorRT python API using the following process :-

1.Start with a frozen TensorFlow model

2.Create a model parser

3.Optimize model and create a runtime engine

4.Perform inference using the optimized runtime engine

The TensorRT build looks like this :-

Now let’s look at our second Inference Accelerator i.e. TFLite.

__TensorFlow Lite__ :-

__TensorFlow Lite__:-

TFLite is a more general purpose Inference Accelerator for Edge Computing compared to TensorRT as the former can run on multiple devices as opposed to latter, which is Nvidia-centric device.

Features of TFLite :-

- Light Weight.
- Low-Latency.
- Privacy.
- Improves Power Consumption.
- Efficient Model Format.
- Pre-trained models.

The table below gives components of TFLite :-

The TFLite architecture can be depicted as :-

The performance of TFLite can be depicted as :-

A few points here :-

- TFLite can use a GPU Delegate and NNAPI also as a delegate.
- Not all operations are supported by GPU backend.

Let’s discuss the **Optimization Techniques** used in TFLite :-

1) Quantization

2) Weight Pruning

3) Model Topology Transforms

Here’s why Quantization is useful :

- All Available CPU platforms are supported.
- Reducing Latency and inference Cost.
- Low Memory Footprint.
- Allow execution on fixed point operations.
- Optimised models for Special Purpose HW Accelerators (TPU).

The TFLite model can be created by this process :-

Internally, TFLite follows this workflow :-

**TFLite converter** can be called like this :-

Let’s see code to create a simple Keras model and convert it with TFLite :-

*What if we wish to convert MobileNet to TFLite code :-*

We can also execute TFLite conversion from **command line** :-

The Quantization code is given by :-

Let’s discuss what is driving the Deep Learning Inference at Edge nowadays :-

Factors driving this trend:

- Smaller, faster models.
- on Device Accelerators.
- Demands move ML capability from cloud to On-device.
- Smart device Growth demands bring ML to Edge

Benefits of On-device ML:

- High Performance.
- Local Data Accessibility.
- Better Privacy.
- Works Offline.

Let’s review some edge devices :-

*Coral :-*

Features :-

Mendel OS.

Edge TPU Compile.

Mendel Development Tool

*Raspberry Pi :-*

Features :-

Small Sized, Low cost,

Just Like a Computer,

Raspbian OS.

Options to deploy TensorFlow and TFLite on edge devices :-

- Compile Tensorflow from Source.
- Install Tensorflow from Pip.
- Use Interpreter directly.

**CellStrat AI Lab** is India’s most advanced AI Lab. Participate in our Global Online Code Jam on COVID-19 use cases **this Saturday 4th April 2020** :-

__CellStrat AI Lab meetup__ :** Register **: https://bit.ly/2J24EKJ

**: Global Code Jam on COVID-19 AI-based solutions**

*Topic***: Saturday 4th April 2020, 10:30 AM – 5:00 PM**

*Date***: Dr Purnendu Sekhar Das**

*Session Leader*See you this Saturday for the AI Lab meetup in BLR ! Lets disrupt the world with AI, together !

Questions ? Call me at **+91-9742800566** !

Best Regards,

Vivek Singhal

Co-Founder & Chief Data Scientist, CellStrat

+91-9742800566