## Research/Blog

# Pose Estimation with OpenPose

- June 9, 2020
- Posted by: vsinghal
- Category: Computer Vision Retail

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops*

Last Saturday, our AI Lab Researcher **Niraj Kale** presented an excellent session on **OpenPose**, an algorithm to efficiently detect the 2D pose of multiple people in an image.

Pose skeleton represents orientation of a person in graphical format. Each coordinate in the skeleton is known as keypoint. A valid connection between two keypoints is known as a limb.

## Applications of Pose Estimation :-

- Activity Recognition
- Detect if person has fallen
- Tech workout regimes, sports techniques, dance activities
- Understanding full body sign language
- Security and surveillance

- Motion capture and Augmented reality
- CGI applications in movies

- Training robots
- Robots can be made to follow the trajectory of human pose skeleton that is performing action

- Motion tracking in gaming consoles

## Approaches for Multi-person Pose Estimation :-

- Top Down approach
- Detect person first
- Estimate the parts
- Calculate the pose for each person

- Bottom up approach
- Detect all parts in an image
- Grouping parts belonging to each person

Drawbacks of Top down approach :-

- If person detector fails for partial person with close proximity, then this failure cannot be recovered
- Runtime is proportional to number of people

## Architecture of OpenPose :-

Initially the feature map F is extracted using VGG-19 layer. This is then input to two parallel layers, i.e. B_{1} and B_{2} branch.

The first branch predicts a set of confidence maps, with each map representing a particular part of the human pose skeleton.

The second branch predicts a set of PAF (part affinity fields) which represents the degree of association between parts.

Simultaneously inferring these bottom up representations of detection and association encodes sufficient global context for greedy parse to achieve high quality results.

### Steps involved in human pose estimation using OpenPose :-

The figure above illustrates the overall pipeline of our method.

The system takes, as input, a color image of size w × h (Fig. a) and produces the 2D locations of anatomical keypoints for each person in the image (Fig. e).

First, a feedforward network predicts a set of 2D confidence maps S of body part locations (Fig. b) and a set of 2D vector fields L of part affinity fields (PAFs), which encode the degree of association between parts (Fig. c).

The set S = (S_{1} , S_{2} , …, S_{J} ) has *J *confidence maps, one per part, where S_{j} ∈ R^{w×h}, j ∈ {1…J}. The set L = (L_{1c, }L_{2c, }…, L_{c}) has *C *vector fields, one per limb, where L_{c} ∈ R^{w×h×2}, c ∈ {1…C}. We refer to part pairs as limbs for clarity, but some pairs are not human limbs (e.g., the face). Each image location in L_{c} encodes a 2D vector.

Finally, the confidence maps and the PAFs are parsed by greedy inference (Fig. d) to output the 2D keypoints for all people in the image.

## New Architecture :-

The figure above shows the Architecture of the multi-stage CNN. The first set of stages predicts PAFs L^{t}, while the last set predicts confidence maps S^{t}. The predictions of each stage and their corresponding image features are concatenated for each subsequent stage. Convolutions of kernel size 7 from the original approach (original architecture above) are replaced with 3 layers of convolutions of kernel 3 which are concatenated at their end.

### Confidence Maps :-

### Keypoints ID for COCO dataset :-

### Example from COCO dataset :-

**S** will have elements of **S _{1}, S_{2}, S_{3},…, S_{19}**.

**S**corresponds to the confidence map for the key point id of 0 which refers to the nose. Then, the confidence map might look as follows.

_{1}The figure above shows a very simplified diagram showing a single confidence map where each cell in the table corresponds to a pixel in the original image of dimensions w x h. The value in each cell represents the confidence that a Nose is present.

### Part Affinity Field (PAF) Maps :-

**C, **the total number of limbs, depends on the dataset that OpenPose is trained with.

For COCO dataset, **C** = 19. The figure below shows the different part pairs.

### Simultaneous detection and association :-

- The initial stage is a fine-tuned VGG-19 layer.
- This layer generates feature maps F that is input to the first stage.
- Next stage generates Part affinity fields (PAFs) L
^{1}= φ^{1}(F), where φ^{1}refers to the CNNs for inference at Stage 1. - In each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to produce refined predictions,

where φ^{t} refers to the CNNs for inference at Stage t, and T_{P} to the number of total PAF stages.

- After T
_{P}iterations, the process is repeated for the confidence maps detection, starting in the most updated PAF prediction,

where ρ^{t} refers to the CNNs for inference at Stage t, and T_{C} to the number of total confidence map stages.

### Loss function :-

L2 loss between the estimated predictions and ground truth maps and fields.

- The notation
**p**represents a single pixel location in a**w**x**h**image. - The
*****notation next to the set**S**and**L**means that it is the ground truth - The output of
**S(p)**is a 1 dimensional vector which consists of the confidence score for that particular body part**j**at image location**p.** - The output of L
**(p)**is a 2 dimensional vector which consists of the directional vector for that particular limb**c**at image location**p.** - In the OpenPose paper,
**J ,**the total number of body part is 19. Also,**C**, the total number of “limbs” or body to body connections is 19. **W(p)**represents the weighing function as previously mentioned.**W(p) = 0**when the annotation is missing at an image location**p.**The mask is used to avoid penalizing the true positive predictions during training.

The overall Loss function looks like this :-

### Multi-person :-

### Confidence Maps for part detection :-

We first generate individual confidence maps **S ^{*}_{j,k}** for each person

*k*. Let x

_{j,k}∈ R2 be the groundtruth position of body part

*j*for person

*k*in the image. The value at location

**p ∈**

**R**in

^{2}**S**is defined as,

^{*}_{j,k}where σ controls the spread of the peak. The groundtruth confidence map predicted by the network is an aggregation of the individual confidence maps via a max operator,

- Maximum of confidence map is used.
- So, that the peak for different points remain distinct

### Part Affinity field for part association :-

Consider a single limb shown in the figure below. Let **x**_{j1,k} and **x**_{j2,k} be the groundtruth positions of body parts *j*_{1} and *j*** _{2}** from the limb

*c*for person

*k*in the image. If a point

**p**lies on the limb, the value at

**L**is a unit vector that points from

^{*}_{c,k}(p)*j*

_{1}to

*j*

**; for all other points, the vector is zero-valued.**

_{2}To evaluate f** _{L}** in the Overall Loss equation above during training, we define the groundtruth PAF,

**L**, at an image point

^{*}_{c,k}**p**as

Here

is the unit vector in the direction of the limb.

The groundtruth part affinity field averages the affinity fields of all people in the image,

### Multi-person parsing using PAFs :-

Initially body part detection candidate *D** _{J }*is
obtained:

*D** _{J }*=
{ d

_{j}

^{m}: for j ∈ { 1…J}, m ∈{1…Nj}}

N_{j }is the number of candidates for part j, and D_{j}^{m} ∈ ℝ^{2} is the location of m^{th} detection candidate for body part j.

These body parts need to be associated with the body parts of the same person.

where E** _{C}** is the overall weight of the matching from limb type

*c*, Z

**is the subset of Z for limb type**

_{C}*c*, and E

**is the part affinity between parts**

_{mn}**d**and

^{m}_{j1}**d**defined in Eq. 10. The above 2 equations enforce that no two edges share a node, i.e., no two limbs of the same type (e.g., left forearm) share a part. We can use the Hungarian algorithm to obtain the optimal matching.

^{n}_{j2}### Optimization of body part detection :-

Initially minimal number of edges are chosen to obtain a spanning tree skeleton as shown in Graph Matching image above (c).

Further the matching problem is decomposed into a set of bipartite matching subproblems and determining in adjacent tree nodes as shown in Graph Matching image above (d).

This gives minimal greedy inference with good approximation of global solution at fraction of computational cost.

With these two relaxations, the optimization is decomposed simply as:

### Common failure cases :-

OpenPose fails in these examples :-