Code is available in GitHub,
I taught these AI to dance on In-My-Feelings song by Drake using PoseNet and Tensorflow.js
I have used the Pose Detection model called PoseNet implemented in Tensorflow.js. The original repository can be found in https://github.com/tensorflow/tfjs-models/tree/master/posenet.
I have modified the codes to enable the following
The input video, output video and background image can be found in ‘video’ directory.
]]>SSD network is trained end-to-end for multi-learning of classification and localization simultaneously. It starts with positive and negative training samples obtained from the matched priors. The global objective is to start with positive samples as predictions and being trained to regress the predictions closer to the matched ground-truth boxes. At each step for each positive sample for each grid cell it generates the following output,
After each training step, it keeps the priors for those the global loss function is reduced and re-adjust them to better match with the ground-truth boxes. The model converges when either the difference between the priors and ground-truth boxes are close to zero or the loss curve remains steady for a long time (i.e not decreasing anymore). In this post, I would like to explain three important concepts which anyone needs to understand in order to train their object detection model with SSD.
The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss (conf) where localization loss is the mismatch between the predicted boundary box and the ground truth box and confidence or classification loss is the loss in assigning class labels to predicted boxes. Therefore, the equation for total objective loss is defined in Equation 1.
\[ L\left (x, c, l, g \right) = \frac{1}{N}\left (L_{conf}\left ( x, c \right) + \alpha L_{loc}\left (x, l, g \right)\right)\;\;\;(1) \]
Here, \( x \) is 1 if the prior is matched to the determined ground truth box, and 0 otherwise, \( N \) is the number of matched priors, \( l \) is predicted bounding box, \( g \) is ground-truth bounding box, \( c \) is class, \( L_{conf} \) is condence loss, \( L_{loc} \) is localization loss, and α is the weight for localization loss. Usually Smooth-L1 loss is used for localization on \( l \) and \( g \) and Softmax loss is used for optimizing confidence loss over multiple class confidences \( c \). Details of the derivation of the loss function is described in the original paper of SSD.
L2-regularization is a method used for reducing model complexity and overfitting on training data. During training, few weights can get too large and thus influence the output prediction heavily by overpowering rest of the weights. It makes the model lose its generalization ability and overfit on training data. L2-regularization penalizes large weights and makes sure the model learns smaller but consistent weights through-out all the layers instead of clustering around few activations. L2-regularization is added to the loss function like following equation,
\[ L = \frac{1}{N}\sum_{i=1}^{N}L_{i} + \lambda \sum_{i}^{ }\sum_{j}^{ }W_{i,j}^{2}\;\;\;(2) \]
Here, \( N \) is the number of classes, \( L_{i} \) is the loss of class \( i \), \( W_{i,j}^{2} \) is the L2-regularization function that calculates sum of squares of each weight from the weight matrix W and λ is regularization hyperparameter which controls the amount or strength of regularization to be added.
Batch-normalization is another important regularization method that accelerates the model training by allowing it to use higher learning-rates. While training, large dataset is passed as batches due to memory constraint. The distribution of the batches usually differ from each other which significantly affects the behavior of the model. This change in input distribution is called Covariate shift. Neural networks change the weights of each layer at each iteration at each mini-batch of input. Due to covariate shift, layer activations change quickly to adapt to its changing inputs. Since the activations of a previous layer are passed as output to the next layer, eventually all the layers are affected inconsistently for each batch. This results in longer time for model convergence as well as oscillating around a local minima forever. One way to avoid this case is to normalize activations of each layer to have zero mean and unit variance. It helps each layer to learn on a more stable distribution of inputs in all the batches which speeds-up the model training. However, forcing activations of all layers to be fixed mean and variance can limit the underlying representation of input data. Therefore, instead of keeping mean and variance fixed, batch normalization allows the model to learn parameters γ and β that can convert the mean and variance to any value based on input data and model architecture. The equations for calculating γ and β are explained in the original paper.
Majority of the literature report time complexity of deep learning models in terms of total time taken by the model for training and inference when run on specific hardware. This is different than traditional algorithms where time-complexity is defined by the order of input size using big-oh notation. Standard deep learning models perform millions of matrix multiplication, convolution, product and summation which is computationally very expensive. However, most of these calculations can be performed in parallel because at each layer of the network thousands of identical artificial neurons perform the same computation. This uniformity can be leveraged by using Graphical Processing Unit (GPU) due to its parallel computing architecture, large number of computational unit and higher bandwidth to retrieve from memory. I have implemented SSD MobileNet during my master’s thesis, the paper can be found here. In my work, I have observed that training my SSD model in Nvidia GeForce GTX 1070i GPU has sped up total training time by 10 times than training on five-core CPU. Although, the model complexity discussed here is for object detection model, it is very much applicable for any kind of neural network based model.
Apart from calculating running time on specific hardware, I have also made an approximation of the asymptotic complexity to get an idea of which segment of the model takes up the largest amount of time while training and inference. Forward pass of the base CNN is dominated by the time complexity of matrix multiplication. Number of multiplication to perform depends on size of input channel (or image resolution), number of convolutional layers, number of neurons in each layer, number of filters and size of each filter in each layer and the size of output feature map. Except for the input layer, size of input channel and output channel is determined by the number of neurons in each layer. In addition, at each layer there is an activation function which can be linear or quadratic based on the function. As ReLu activation function performs element-wise calculation, it runs in quadratic time on each neuron on each layer. All these are parameters or weights to be learned during training. Therefore, runtime of a convolutional model scales linearly with the total number of learnable weights. The total time-complexity of convolution layers for forward pass is as followed,
\[ \begin{align} \begin{split}\label{eq:1} time_{forward} & = time_{convolution} + time_{activation}\\ & = O(\sum_{l=1}^{L} n_{l-1}\cdot (f\cdot f)\cdot n_{l}\cdot (m_{l}\cdot m_{l})) + O(L\cdot n_{c})\\ & = O(weights) \end{split} \end{align}\;\;\;(3) \]
Here, \( l \) is the index of a convolutional layer, \( L \) is the total number of layers, \( n_{l-1} \) is the number of input channels of the \( l_{th} \) layer, \( f \) is filter height and width, \( n_{l} \) is the number of filters in \( l_{th} \) layer, \( m_{l} \) is the size of output feature map and \( n_{c} \) is number of neurons in a layer which in practice varies for each layer. For training, there is also back-propagation cost which also runs to calculate weights for all the parameters using chain-rule. Thus, \( time_{backprop} \) is also O(weights).
Apart from the above, there are additional computations like batch normalization, dropout, regression and classification using sigmoid function etc. but they take up only 5-10% of the total training time, as tested with Chromium Event Profiling Tool. Once the model is trained and all the weights are learned, inference just performs summation of the product of weights in linear time. Memory complexity of the model is linear to the number of weights to be stored, thus O(weights).
Accuracy of the trained models are measured for both classification and localization. The accuracy metric used in called Mean Average Precision or mAP. AP is calculated for each class from the area under precision-recall curve. Precision measures the percentage of the model predictions that are made correct and recall measures the percentage of all the possible ground-truth boxes being detected.
\[ \begin{align} \begin{split}\label{eq:pre_recall} Precision = \frac{TruePositive}{TruePositive + FalsePositive} \\ Recall = \frac{TruePositive}{TruePositive + FalseNegative} \end{split} \end{align} \;\;\;(4)\]
Here, \( TruePositive \) is the number of predictions that has IoU greater than 0.5 with ground-truth boxes, \( FalsePositive \) is the number of predictions with IoU less than 0.5 with ground-truth boxes and \( FalseNegative \) is the number of ground-truth boxes that are not detected by the model.
Now predictions are sorted by their confidence score from highest to lowest. Then 11 different confidence thresholds called ranks are chosen such that the recall at those confidence values have 11 values ranged from 0 to 1 by 0.1 interval. The thresholds should be such that the Recall at those confidence values is 0, 0.1, 0.2, 0.3 till 0.9 and 1.0. Average Precision or AP is now computed as the average of maximum precision values at these chosen 11 recall values. Therefore, AP for class \( c \) is defined as by Equation 5 where \( P(r) \) is the precision value for one of the 11 recalls r,
\[ AP_{c} = \frac{1}{11}\sum_{r\epsilon \left \{ 0.0,…1.0 \right \}}^{R} max(P(r)) \;\;\;(5)\]
Then mAP is simply the average of APs over all the classes as shown in Equation 6, where \( C \) is the total number of classes and \( AP_{c} \) is \( AP \) for class \( c \).
\[ mAP = \frac{1}{C}\sum_{c}^{C}AP_{c}\;\;\;(6) \]
The greater the mAP, the better the model is in terms of accuracy.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
]]>SSD is a one-step framework that learns to map a classification-and-regression problem directly from raw image pixels to bounding box coordinates and class probabilities, in single global steps, thus the name “single shot”. Fig. 1 illustrates the architecture of SSD. It incorporates a Deep Convolutional Neural Network as the base feature extractor similar to Faster R-CNN. However, the fully-connected layers from the extractor is discarded and replaced by a set of auxiliary convolutional layers as shown in blue. These auxiliary layers are responsible for extracting features at multiple scales and progressively decrease the size of the input at each subsequent layer. Instead of generating anchor boxes at each feature map separately like in Faster R-CNN, SSD generates anchor boxes directly on the multi-scale feature maps coming from the base CNN.
In SSD (ECCV 2016), authors have used six auxiliary layers for producing feature maps of size 8 × 8, 6 × 6, 4 × 4, 3 × 3, 2 × 2 and 1 × 1 respectively. Due to multiple scaling, generated anchor boxes better cover all the diverse shapes of objects including very small objects and very large ones. Features extracted by each region proposal or anchor from each feature map is simultaneously passed to a box regressor and a classifier. Finally, the network outputs coordinate offsets of predictions relative to anchor boxes, objectness score (here confidence score) of each predicted boxes and probabilities of an object inside a predicted box being each of the \( C \) classes. For instance, for feature maps of size \( f = m \times n \), \( b \) anchors per cell in each of the feature maps and \( C \) total classes, SSD will output total \( f \times b \times (4 + c) \)values for each of the six feature maps. Anchors boxes are generated with different aspect ratios and scales similar to Faster R-CNN, however, due to applying them on multiple scaled feature maps objects with more variety of sizes and scales are detected more accurately.
Integrating with some pre and post-processing algorithms like non-maximum suppression and hard negative mining, data augmentation and a larger number of care- fully chosen anchor boxes, SSD significantly outperforms the Faster R-CNN in terms of accuracy on standard PASCAL VOC and COCO object detection dataset, while being three times faster.
As mentioned earlier, SSD framework is a single feed-forward convolutional neural network which performs both object localization and classification in a single forward pass. However, the whole network is composed of six major parts or algorithms and each of them plays crucial role in providing the final output.
Convolutional Box Predictor
Priors generation
Matching priors to ground-truth boxes
Hard Negative Mining
Post-processing – Non-maximum suppression
SSD architecture is based on a deep CNN which works as the feature extractor. Deep CNNs has showed high performance in image classification on many benchmark datasets, which indicates that the features learned by these networks can be very useful in performing other more sophisticated computer vision tasks like object detection as well. In case of real-time inference, MobileNet is a widely used Deep CNN which is at least 10 times faster than its contemporary regular deep CNNs like VGGNet-16, when run in both GPU and CPU.
In MobileNet model, each standard convolutional layer is replaced by two separate layers called a depthwise convolution followed by a pointwise convolution, combinedly called as depthwise separable convolution. A standard convolution layer both filters and combines inputs, no matter how many channels it has, into a new set of single channel output pixels in one step. Whereas the depthwise separable convolution splits this process into two layers, a separate layer for filtering independently on each channel and a separate layer for combining the filtered channels into a single output channel. This factorization has the effect of drastically reducing computation and model size, therefore resulting in a very fast model for real-time performance.
(a) regular convolution (b) depthwise separable convolution
Fig 2. Regular VS depthwise separable convolution (Source)
In Fig. 2(a), regular convolution with 3 × 3 filters on a 3-channel input generates an output of a single channel image pixel in a single go. On the other hand, depthwise separable convolution in Fig. 2(b) at first applies 3 × 3 filters on each of the three channels separately, generating output image that also has 3 channels where each channel gets its own set of weights. In the next step, pointwise convolution is applied which is same as a regular convolution but with a 1 × 1 kernel. Pointwise convolution simply adds up all the output channels of the depthwise convolution as a weighted sum in order to create new features.
Although, the end result of both regular and depthwise separable convolution is similar, a regular convolution has perform much more mathematical computation to obtain the result and needs to learn more weights compared to the latter one. For instance, the number of computation required to perform convolution with 32 3 × 3 filters on a 16-channels input using regular convolution is following,
\[ 16 × 32 × 3 × 3 = 4608 \]
Whereas the number of computation required by depthwise separable convolution is following (7 times faster),
\[ 16 × 3 × 3 + 16 × 32 × 1 × 1 = 656 \]
The original MobileNet architecture uses one regular convolutional layer in the first layer and 13 consecutive depthwise convolution layers each followed by a point-wise separable layer. Batch Normalization (BN) and ReLU activation function are applied after each convolution. After the convolutional layers, there is an Average Pooling layer of kernel 1 × 1 followed by a Fully-connected layer linked to a softmax classifier for classification. Complete architecture of MobileNet is described in the original paper.
For object detection task, original base extractor is extended to a larger network by removing and adding some successive layers. From the MobileNet architecture, the last fully-connected layer is discarded and a set of auxiliary convolutional layers are added which are called convolutional box predictor layers. These layers produce multiple feature maps of sizes 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 in successive order which are stacked after the feature map coming from the last convolutional layer of MobileNet of size 38 × 38. All six auxiliary layers extract features from images at multiple scales and progressively decrease the size of the input scan to each subsequent layer. Performing convolution on multiple scales enables detecting objects of various sizes. As example, feature maps of size 19 × 19 performs better in detecting small objects whereas 3 × 3 performs better for larger objects. Each of the box predictor layers is connected to two heads, one regressor to predict bounding boxes and one sigmoid classifier to classify each of the detected boxes. Let’s consider an input image as shown in Fig. 3. The image has three classes of objects, namely as “person”, “car” and “vase”.
Fig. 4 shows the SSD architecture that consists of base feature extractor and auxiliary predictor layers. Each layer predicts some bounding boxes for the class person. Eventually predictions from all the layers are accumulated in the post-processing layer which provides the final prediction for person as the output. The same procedure follows for each of the classes present in the input image.
Priors or anchor generation is one of the most important tasks that can significantly affect the overall model performance. Priors are a collection of pre-defined boxes overlaid on the input scan as well as the predictor feature maps at different spatial locations, scales and aspect ratios that act as reference points for the model to predict the ground truth boxes. Ground truth boxes are the boxes that are annotated by human while labeling the dataset. Prior boxes are really important because they provide a strong starting point for the bounding box regression algorithm to predict the ground truth boxes as opposed to starting predictions with completely random coordinates. Each feature map is divided into number of grids and each grid is associated to a set of priors or default bounding boxes of different dimensions and aspect ratios. Each of the priors predict exactly one bounding box. In the following example, there are priors of six aspect ratios for each grid cell in each of the six feature maps. As shown in Fig. 5, each grid cell of a 5 × 5 feature map have six priors which mostly matches with the classes or objects we want to detect. As example, red priors are most likely to capture car whereas blue priors are there to detect person and vase of both smaller and larger sizes. Also, feature maps represent the dominant features of the images at different scales, therefore generating priors on multiple scale feature maps increase the likelihood of any object of any size to be eventually detected, localized and appropriately classified.
Next, each ground truth box is matched to priors having the biggest Jaccard Index or Intersection over Union (IoU). In addition, unassigned priors having IoU greater than a threshold are also matched to ground truth boxes, usually the threshold is 0.5. At this point, all the matched priors are called positive training examples and rest are considered as negative examples. In Fig. 6, blue prior is a positive example since it has greater IoU with the ground truth box of person showed in black and red prior is a negative example because it does not have IoU greater than 0.5 with any of the ground truth boxes.
Among a large number of priors, only a few will be matched as positive samples and rest of the large number of priors will be left out as negatives. From the training point of view, there will be a class imbalance between positive and negative samples which may make the training set biased towards negatives. To avoid this, some of the negative examples are randomly sampled from the negative pool in such a way so that the training set have a negative to positive ratio of 3:1. This process is called hard negative mining. The reason why it is required to keep negative samples is because the network also needs to learn and be explicitly told about the features that lead to incorrect detection.
At the prediction stage while both training and inference, every prior will have a set of bounding boxes and labels. As a result, there will be multiple predictions for the same object. Non-maximum suppression algorithm is used to filter out duplicate and redundant predictions and keep the top K most accurate predictions. It sorts the predictions by their confidence score for each class. Then it picks predictions from the top and removes all the other predictions that have IoU greater than 0.5 with the picked one. In this way, it ends up with having maximum one hundred top predictions per class. This ensures that only the most likely predictions are kept by the model, while the more noisier ones are discarded. Fig. 7 shows complete SSD architecture with the non-maximum suppression layer which filters out all the less accurate predictions and generating the one as output with the highest confidence score.
Now that we know about all the individual components, it is time for training this whole framework. As mentioned earlier, because it is a one-step feed-forward end-to-end learning model, all of these components are trained simultaneously at each single iteration. The training process is explained in the next part Training Single Shot Multibox Detector.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
Convolutional neural networks or CNNs are deep neural networks considered as the most representative model of deep learning. CNNs extend on traditional neural networks by combining both fully-connected hidden layers and locally-connected convolutional layers. In traditional neural networks, each neuron in the hidden layer is fully-connected to all neurons in the preceding layer. However, in case of large 2D input images full-connectivity gets highly computationally expensive. It also does not take advantage of the local correlations of pixels in real-world images. Inspired by the biology of human vision, CNNs take advantage of the spatial correlation of objects present in real-world images and restrict the connectivity of each node to a local area known as its receptive field. According to LeCun, Bengio and Hinton in Deep Learning (Nature 2015), CNNs use four key ideas that greatly reduce the number of parameters to be learned and improve the approximation accuracy compared to traditional DNNs. These are: local connections, shared weights, pooling and the efficient use of many layers.
A typical CNN architecture is demonstrated in Fig. 1.
The hidden layers in a CNN can be any of the three types from convolutional layer, pooling layer and fully-connected layer. Each neuron in the input layer takes each image pixel as input variables and passes them to a small number of adjacent neurons of the next layer they are connected to. Image region covered by these adjacent neurons is called the receptive field as mentioned earlier. Each neuron in the hidden layer takes a receptive field from the previous layer as input and transforms that into many different representations by applying convolution and pooling filters. Filters are 2D matrices of weights which are analogous to the weight parameters of the traditional DNN described in previous section. For each filter, a different representation of the output of previous layer is obtained which is called a feature map. In Fig. 1, stacked images in the convolutional and pooling layers are the feature maps or channels or layer depths. Filter weights are shared across a single feature map. On the other hand, different shared filter weights are used for different feature maps. Activation functions are applied to each of the feature maps individually in each layer.
Convolutional filters convolute a weight matrix with the values of a receptive field of neurons in order to extract useful features or patterns. These filters learn more high- level features as the network gets deeper. At the beginning of the training, weights are randomly initialized. However, during training process, weights are adjusted through backpropagation in order to learn the features which are most representative for correct prediction.
In Fig. 2, an input matrix in light blue is convolved by a filter matrix in dark blue and generates a feature map in green by taking the summation of the element- wise product of each cell of the filter and each cell of the area covered by the filter of the input matrix. In this convolution, no zero-padding on the input matrix and stride of 1 on the filter is used.
Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. In other words, pooling layers compress the responses of a receptive field into a singleton value to produce more robust feature representations. A typical neuron in the pooling layer (MaxPool) takes the maximum of a local patch or region of cells in a feature map resulting in a lower dimensional representation of the particular feature map. As a result, locally connected pooling neurons take features from patches that are already shifted by more than one row or column, thereby creating an invariance to small shifts and distortions to new unseen images.
In Fig. 3, a 5 × 5 feature map is reduced to a 3 × 3 map after taking the maximum value from each of the nine patches using 1 x 1 strides. Pooling filter weights are not learnable parameters thus they don’t take part in backpropagation.
CNNs have the great advantage of very deep feature extraction through hierarchical representation of images from pixel to high-level semantic features. In CNNs, features extracted from one convolution layer followed by a pooling layer are passed as input to next layer along with all the different transformation learned by feature maps. More importantly, CNNs learn this representation automatically from the given data without any domain-expert defined rules. Along with multi-layer non- linear mappings using activations, many hidden pattern of the input data can also be discovered which makes CNNs exponentially expressive compared to traditional computer vision algorithms. Therefore, currently CNNs are the most common model to be used as the base feature extractor in any computer vision tasks including image classification, object detection, face recognition etc.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
Neural networks are used for three main categories of learning, namely as supervised, unsupervised and reinforcement learning. Training of a neural network involves the combination of several individual algorithms and helper components like a base architecture, optimisation algorithm called gradient descent, convolution, pooling, softmax classifier etc. In this article, three of the most important components are explained.
The loss function, also called a cost function, is a measure of quantifying the capacity of a network to approximate the true function that produces the desired output from given data. In supervised learning, desired output is the ground truth labels or classes \( \hat{y} \) of each sample in the dataset. In a single forward pass, the network takes as inputs the weights, biases and examples from the training set and predicts the labels \( \hat{y} \) for each sample. The difference between the predicted labels and the ground truth labels are calculated which is the total loss of the network. Although there are many loss functions that exist, all of them essentially penalize on the difference or error between the predicted \( \hat{y} \) for a given sample and its actual label y. Mean squared error or MSE is a commonly used loss function. For \( m \) number of samples in the training set, the equation for calculating total loss over weight parameters, \( J(w_{1},…w_{n}) \) using MSE is shown in Equation 1.
\[ Loss(y,\hat{y}) = J(w_{1},…w_{n}) = \frac{1}{m}\sum_{i=1}^{m}\left ( y_{i}-\hat{y_{i}} \right )^{2}\;\;\;(1) \]
The purpose of training a neural net is to minimize the loss by adjusting the parameters, like weights, biases etc. The smaller the loss, the more accurately the true function is approximated. Gradient descent algorithms are the most commonly used methods for loss function optimization. At the beginning of the training, network parameters are randomly initialized. Therefore, the loss is very high. Gradient descent algorithms calculate the first order derivative of each of the parameters with respect to the loss function and directs in which direction the parameter values have to be adjusted in order to reduce the loss. Eventually, it finds the global minimum of the loss function curve which is the point at which the loss is minimum. All model parameters including weights are defined as θ. Therefore, for each \( \theta_{j} \), gradient descent algorithms find the derivative of each \( \theta_{j} \) for each input vector \( X_{0} \) to \( X_{n} \) with respect to total loss \( J(\theta_{j0},…\theta_{jn}) \) and update each parameter using Equation 2.
\[ \theta_{j} := \theta_{j} \;-\; \alpha \frac{\partial }{\partial \theta_{j}} J(\theta_{j0},…\theta_{jn})\;\;\;(2) \]
Here, α is the learning rate that defines the step size in the direction the parameters are adjusted to. The algorithm iteratively performs this operation until either the global minimum is reached or the model converges.
The feature space of the loss function can be convex or even highly non-convex with tens of millions of parameters depending on model complexity. Fig. 1 and Fig. 2 show examples of gradient descent in both convex and non-convex loss surface having only two parameters \( \theta_{0} \) and \( \theta_{1} \). The black line shows the gradient started from an initial point flowing all the way down to the global minimum on convex and local minimum on non-convex surface.
Calculating the derivatives of millions of parameters with respect to total loss and find the optimal point in a high-dimensional convex or non-convex surface is analogous to finding a needle in a haystack. Accomplishment of this task using the naive approach is extremely unfeasible. That’s where backpropagation comes to the rescue. Back- propagation is a fast way of calculating all the parameter gradients in a single forward and backward pass through the network rather than having to do one single pass for every single parameter. It utilizes the chain rule of partial derivatives to compute gradients on a continuously differentiable multivariate function in a neural network. The chain rule enables gradient descent algorithms to decompose a parameter derivative as a product of its individual functional components. Use of backpropagation allows a backward pass take roughly the same amount of computation as a forward pass and makes doing gradient descent on deep neural networks a feasible problem by speeding up the training dramatically.
Fig. 3(b) illustrates backpropagation in action in a neural network with two hidden layers layer 1 and layer 2 as shown in Fig. 3(a). After a forward pass (shown with black arrow) the loss function \( J(a^{[2]},y) \)is evaluated. Then, in the backward pass (shown in red arrow), the partial derivative of each of the four parameters \( W^{[1]} \), \( b^{[1]} \), \( W^{[2]} \) and \( b^{[2]} \) with respect to \( J(a^{[2]},y) \) is calculated using chain rule. As example, to calculate the partial derivative of \( W^{[1]} \), we have to first partially derive \( z^{[1]} \) (the weighted sum function in layer 1) with respect to \( J(a^{[2]},y) \) as \( \frac{\partial ( J(a^{[2]},y) )}{\partial z^{[1]}} \). Again, in order to find \( \frac{\partial ( J(a^{[2]},y) )}{\partial z^{[1]}} \), \( \frac{\partial ( J(a^{[2]},y) )}{\partial a^{[1]}} \) has to be calculated where \( a^{[1]} \) is the layer 1 activation function. Consequently, the derivative of \( a^{[1]} \) comes from the derivative of functions \( z^{[1]} \) and \( a^{[2]} \) in layer 2. In this way, backpropagation propagates the total loss backward all the way down to each parameters and adjust them accordingly using gradient descent. Therefore, it gives a detailed insight into how changing the weights and biases changes the overall behaviour of the network a.k.a the change in total network loss.
The above three components combined are run lot many times or iterations during the training phase. After each iteration the global loss of a neural network is reduced from the previous iteration, indicating that the network is being trained better and better.
This was the last part of the series Neural Network Demystified. I hope you have found this article and the series useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
Artificial Neural Networks (ANNs) are motivated by the nervous system of the human brain where approximately 86 billion neurons are interconnected with \( 10^{14} \) to \( 10^{15} \) synapses. In human nervous system, each neuron receives input signals from its dendrites and produces output signal along its axon. Comparatively, ANNs are a way smaller and simplified computing model of the nervous system with tens to as many as thousands of neurons interconnected as processing elements, which process information by their dynamic state response triggered by external inputs (i.e. user given data). As seen in Fig. 1, just like a biological neuron has dendrites to receive signals, a cell body to process them, and an axon to send signals out to other neurons through axon branches, the artificial neuron has a number of input channels, a processing stage called f(x), and one output analogous to axon branches that can fan out to multiple other artificial neurons. ANNs are comprised of stacks of layers, where each layer contains a pre-defined number of neurons just like the one in the figure. Neurons from each layer are connected by edges which are similar to synapses. Axons are analogous to the output obtained from a single layer.
Any shallow neural network has at least three layers, namely as an input layer, hidden layer(s) and an output layer. Hidden layers are intermediate layers between input and output layer. In Fig. 2, the network has 7 neurons altogether. These neurons are stacked in 3 fully connected layers, that is each neuron from one layer is connected to each neuron from the following layer. The number of layers in the network, the number of neurons in each layer, how they are connected (as example, fully-connected or locally connected) – all these are usually pre-defined while designing the network architecture. First layer, consisting of 3 neurons, is the input layer where these neurons do not participate in any kind of calculation, rather they just propagate the input to the next hidden layer with their associated weights and bias. For simplicity, in this network the bias term is omitted. The hidden layer calculates the weighted sum from each neuron and passes the result as a 3 x 3 matrix to the ReLU activation function. The output of ReLU is used as input to the output layer. In the output layer, sigmoid activation function is used to transform the resultant matrix into a categorical or numerical value which is the final output of the network.
In Fig. 2, \( h_{1}^{[1]} \), \( h_{2}^{[1]} \) and \( h_{3}^{[1]} \) are the neurons of hidden layer 1. From the network, they are calculated individually from the weighted sum of \( x_{1} \), \( x_{2} \) and \( x_{3} \) as following,
\( h_{1}^{[1]} = w_{1}\cdot x_{1}+w_{4}\cdot x_{2}+w_{7}\cdot x_{3} \)
\( h_{2}^{[1]} = w_{2}\cdot x_{1}+w_{5}\cdot x_{2}+w_{8}\cdot x_{3} \)
\( h_{3}^{[1]} = w_{3}\cdot x_{1}+w_{6}\cdot x_{2}+w_{9}\cdot x_{3} \)
Now \( h_{1}^{[1]} \) , \( h_{2}^{[1]} \) and \( h_{3}^{[1]} \) are passed to corresponding ReLU activation function \( A_{1}^{[1]} \) , \( A_{2}^{[1]} \) and \( A_{3}^{[1]} \) for non-linear transformation. The weighed sum of the activations generates the final output \( \hat{y} \). \( w_{10} \), \( w_{11} \) and \( w_{12} \) are weights associated to \( A_{1}^{[1]} \) , \( A_{2}^{[1]} \) and \( A_{3}^{[1]} \) respectively. Therefore, the function for calculating \( \hat{y} \) is defined in the following equation,
\( \hat{y} \) = \( f\left ( x_{1},x_{2}, x_{3} \right ) \) = \( Sigmoid(w_{10}\times ReLU({h_{1}}^{[1]}) + w_{11}\times ReLU({h_{2}}^{[1]}) + w_{12}\times ReLU({h_{3}}^{[1]})) \)
Deep neural networks (DNNs) are basically ANNs with many hidden layers ranges from 3 to 152 (ResNet). The more layers stacked up in between the input and output layer, the deeper the network is considered. The deeper the network, the more complex representation from the data can be learned. The hidden layers learn data representation in hierarchical order, which means that the output of a hidden layer is passed as input to its next consecutive hidden layer and so on.
Many recent breakthrough outcomes are obtained by using very deep models. However, stacking up layers after layers does not always guarantee better model accuracy. In Deep residual learning for image recognition (CVPR 2016), authors have reported that with network depth increasing, the accuracy gets saturated after one point. Also, deeper models require to compute larger number of parameters which makes them computationally expensive and slower. Fig. 3 shows a deep neural network with \( n \) hidden layers.
Next part is Neural Network Demystified Part III – Gradient Descent and Backpropagation.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
]]>
Although the original intention of designing a neural network was to simulate the human brain system to solve general learning problems in a principled way, in practice neural network algorithms work in much simpler way than a human brain. Current neural networks can be compared to statistical models having the higher level purpose of mapping inputs to their corresponding outputs. By the means of lots of complex matrix computations going on under the hood, they try to find correlations by approximating an unknown function f(x) = y between any input x and any output y. In the process of learning, they drive to discover the most accurate function f which is also the best approximation of transforming x into y. These networks can be both linear or non-linear. Non-linear networks can approximate any function with an arbitrary amount of error. For this reason, they are also called “universal approximator”. In this tutorial, let us learn about the building blocks of a non-linear shallow neural network followed by the architecture of a standard deep neural network.
An artificial neuron (derived from the basic computational unit of the brain called neuron), also referred to as a perceptron is a mathematical function that takes one or more inputs and returns their weighted sum as output. Weighted sum is passed through an activation function to transform the otherwise linear function to a non-linear function. This is the reason why neural networks perform well in deriving very complex non-linear functions. Basically, weights are values associated with each input and are parameters to be learned by the network from input data in order to accurately map them to desired output.
In Fig. 1, input \( X \) is a vector containing three elements \( x_{1} \), \( x_{2} \) and \( x_{3} \). Each element have weights \( w_{1} \), \( w_{2} \) and \( w_{3} \) associated to them respectively. In addition to weights, there is also a bias term \( b \) which helps the function better fit the data. The neuron \( f(x) \) calculates the weighted sum using Equation 1. \( f(x) \) is also defined by \( z \) interchangeably.
\[ f(x) = z = w_{1}\cdot x_{1} + w_{2}\cdot x_{2} + w_{3}\cdot x_{3} + b \;\;\;(1) \]
The resultant output is then passed to an activation function A(f(x)) which returns the final output of the neuron.
An activation function is a function applied on a neuron to instill some kind of non-linear properties in the network. A relationship is considered as linear if a change in a variable initiates a constant change in the consecutive latter variables. On the other hand, a non-linear relationship between variables is established when a change in one variable does not necessarily ignites a chain of constant changes in the consecutive latter variables, although they can still impact each other in an unpredictable or irregular manner. Without the injection of some non-linearity, a large chain of linear algebraic equation will eventually collapse into a single equation after simplification. Therefore, the significant capacity of the neural network to approximate any convex and non-convex function is directly the result of the non-linear activation functions. In a small visual example in Fig. 2, we can see that when data points make non-linear patterns, a linear line can not t them accurately, thus it misses some data points. However, a non-linear function captures the difficult pattern efficiently and fits all data points.
Every activation function takes the output of f(x) as a vector as input and performs a pre-defined element-wise operation on it. Among many activation functions, three of them are most commonly used.
Sigmoid
The sigmoid non-linearity function has the following mathematical form,
\[ sigmoid(f(x)) = \frac{\mathrm{1} }{\mathrm{1} + e^{-f(x) }}\;\;(2) \]
It takes a real value as input and squashes it between 0 and 1 so that the range does not become too large. However, for large values of neuron activation at either of the positive or negative x-axis, the network gets saturated and the gradients at these regions get very close to zero (i.e. no significant weight update in these regions), causing “vanishing gradient” problem. Fig. 3 shows the sigmoid function graph that converges in the positive and negative domain as values get bigger.
Hyperbolic Tangent
The hyperbolic tangent or tanh non-linearity function has the following mathematical form,
\[ tanh(f(x)) = \frac{e^{f(x)}-e^{-f(x)}}{e^{f(x)}+e^{-f(x)}}\;\;(3) \]
It takes a real value as input and squashes it between -1 and 1. However, it suffers from the “vanishing gradient” problem in the positive and negative domain like sigmoid function for similar reason. Fig. 4 shows the graph for hyperbolic tangent function.
Rectied Linear Unit (ReLU)
The ReLU has the following mathematical form,
\[ ReLU(f(x)) = max(0,f(x))\;\;(4) \]
ReLU has gained huge popularity as an activation function due to its edge over sigmoid and tanh in couple of ways. It takes a real value as input and squashes it between 0 and +infinity. Because it does not saturate in the positive domain, it avoids the vanishing gradient problem. As a result, it also accelerates the convergence of stochastic gradient descent. Another big advantage of ReLU is that it involves computationally cheaper operations compared to the expensive exponentials in sigmoid and tanh. However, ReLU saturates in the negative domain which allows it to disregard all the negative values. Thus it may not be suitable to capture patterns in all different datasets and architectures. One solution to this problem is LeakyReLU.
Instead of zeroing out the negative activation, LeakyReLU applies a small negative slope of 0.01 or so. Therefore, LeakyReLU has the following mathematical form, where α is the slope.
\[ LeakyReLU\left ( f\left(x \right ) \right ) = \left\{\begin{matrix} max(0, x)\;\;for\;\;f(x) > 0\\ αx\;\; for\;\;f(x)<0 \end{matrix}\right. \;\;(5) \]
Fig. 5 shows ReLU graph where slope keeps increasing only in the positive domain.
Fig. 6 shows LeakyReLU graph where slope keeps increasing in the positive domain but also allows gradients to flow by a restricted amount in the negative domain. Here, α is 0.1.
Next part is Neural Network Demystified Part ll – Deep Neural Network.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
]]>Conducting quality research on any field is hard, however not as hard as explaining the research to a room full of people with non-technical background, that too in just 3 minutes! As a person who is never afraid of taking challenges, I couldn’t afford to lose the opportunity to participate in the prestigious 3-Minute Thesis Competition and put myself to one of the most difficult tests I had been taken by far. The biggest challenge was not to present my thesis in front of such a large audience from all the different disciplines, but was to present it in such a palatable manner that my audience could comprehend and enjoy the flavour of my work. Then there was this challenge of compressing my entire thesis journey of almost two years to fit in just 3 minutes. However, when there is a challenge or test, there is the chance to shine, and come out as a gladiator. My motivation to participate in this competition was to achieve both. During my graduate research assistantship in IDIR, I have been pushed to the front frequently to explain the workings of my AI models to our industrial clients in laymen term but without making them lose interest half-way. To be honest, I have found it more difficult as a task than to present my work to people from the same or similar fields. No matter how satisfied our clients say they are, I always felt that I could do better, that I am not shining as bright as I am capable of in this job, until my performance in the 3-Minute Thesis Competition.
I spent almost a week to come up with a crisp and solid script, which would include a little humour, a little twist and all the highlights of my thesis and contribution being made in past two years. After doing lots of brainstorming of ideas, and practicing the presentation for quite a few times, I was ready for the battle. Although, I was not in the top three winners, I was one of the top five presenters out of all 16 participants. I also received huge appreciation from my peers and general audience for my successful attempt to share such a highly technical work with people from different academic background in such comprehensive and coherent way.
After that competition, I really felt that I have shined in my own way. More importantly, I have gained huge confidence in myself that I can be a compelling public speaker too where necessary. I would recommend anyone who is reading this to consider taking part in such competitions whenever you feel like you are not good enough at something. Give it your best shot and you will be amazed by your own capabilities! Let the inner gladiator in you wake up and stand out in the crowd.
Now, if you have 3 more minutes, I request you to spend them on learning about my master’s thesis. I will really appreciate any kind of feedback or suggestion from you in this regard.
Please check my 3-Minute Thesis Competition video recording and the video transcript below.
Original Thesis: Real-time Automated Weld Quality Analysis from Ultrasonic B-scan using Deep Learning
3-MT Competition Title: Weld Quality Control Using Artificial Intelligence
100 years back when Henry Ford first made cheap, reliable cars, people said, “But, what’s wrong with a horse?” But with time, as cars became more affordable, they became not just essential but also the reflections of ourselves, our emotions and pride. But how many of us know that this reflection of ours is made of over 30,000 unique parts? Yes it’s a miracle that cars don’t break down more often! And how many of us know that on average, a brand new car is built every 16 seconds. In the time it takes you to put milk in a cup of tea, a whole car rolls off a production line somewhere in the world.
One of the essential jobs in car manufacturing is spot welding. It is the process of joining two or more metal parts by melting them on a small spot through heat. Most industries use spot welding robots for large scale car manufacturing. However, the quality inspection of the welded parts is still manual. In fact, in industries they take one sample from a batch of welded parts and tears it off to check if they are properly welded. But this process is destructive and expensive. Not only that, it allows many weld samples pass unnoticed. A better non-destructive alternative is to capture the whole welding process as ultrasonic image like the one behind me and monitor them during production. However, just imagine how tiresome this will be for a person to keep staring at thousands of boring images like this for hours after hours. Thankfully, we have machines who are willing to do all our mundane tasks without any complain.
This is where my research comes into play. I have trained an AI algorithm which can identify patterns visible in the weld image and decide just like a human operator whether the sample should be passed or welded again. During training, each time the algorithm gives a decision, it gets a feedback in return on how close its decision is to the operators decision. Then in the next trial it makes some numerical adjustments in the algorithm by itself and tries to give better decision. This trial and error continues until it starts to act just like human. After practicing this on thousands of images, it is now ready to give correct decision on quality of new unseen weld samples in the production within just 200 milliseconds.
The impact of my research is huge. Previously robots were blind, now they can see how well they are welding. It also saves human from another repetitive no-brainer job and give them more time to spend on more interesting tasks. More importantly, it will let industries save lots of time and money and make cars safer, more durable and more affordable. A prototype of my developed algorithm is on test under the production facility in one of the top car manufacturers in the world and so far they are quite satisfied with the performance.
Now I am very excited about the outcome of my research, not because of the reasons I just mentioned but because my work has the potential to allow more and more people express themselves through their cars. How excited are you feeling right now?
Thank you very much!
]]>
Before delving deep (pun intended!), let’s go through a quick overview of what a neural network is. Let’s consider figure 1.
It is a graphical representation of a linear model represented with nodes and edges. It has three inputs x1, x2 and x3 with their associated weights w1, w2 and w3 respectively. There’s usually an additional bias term b to each of the inputs, however for simplicity let’s assume that there is no bias term in this model. The inputs and weights are mapped to output y ̂ which can be a class label (classification task) or a real-valued result (regression task). The output y ̂is calculated from the following formula,
y ̂ = w1*x1 + w2*x2 + w3*x3
Since it is calculating a linear formula, we can refer to this model as a linear model. Now, unfortunately in most of the cases real world data is hardly linearly separable. Therefore, we often need to add some non-linearity to our classification or regression model. Non-linearity adds new dimensions to the feature space which turns out helpful because linearly inseparable data often becomes separable when transformed to higher dimensions. To add non-linearity to this linear model let’s add an intermediate layer having three nodes or neurons between the input and output. This intermediate layer is called hidden layer.
In figure 2, each input now has three edges towards each of the hidden nodes h1, h2 and h3 in hidden layer 1. This model can be labelled as a neural network, considering each hidden node as a neuron. The output y ̂ is now calculated by the following,
y ̂ = w10*h1^{[1]} + w11*h2^{[1]} + w12*h3^{[1]}
Theoretically, there can be as many intermediate hidden layers as possible. In figure 3, there are n number of hidden layers. The calculation carries on the same way like the model with a single hidden layer, although the number of parameters (i.e. weights) to compute increase exponentially.
Now, let’s go back to our initial question. How many hidden layers are needed to call a neural network a deep neural network? In other words, how deep is deep enough?
In order to answer this question, we need to understand the linear algebra taking place under the hood. In figure 4, each of the hidden nodes are calculated only from the edges coming towards them (incoming edges with same color).
Therefore,
h1^{[1] }= w1*x1 + w4*x2 + w7*x3
h2^{[1] }= w2*x1 + w5*x2 + w8*x3
h3^{[1] }= w3*x1 + w6*x2 + w9*x3
Combining all three, at the output node we have,
y ̂ = w10*h1^{[1]} + w11*h2^{[1]} + w12*h3^{[1]}
However, h1, h2 and h3 are just linear combinations of input feature. Expanding the above equation, we end up having the following,
y ̂ = (w10*w1 + w11*w2 + w12*w3) *x1 + (w10*w4 + w11*w5 + w12*w6) *x2 + (w10*w7 + w11*w8 + w12*w9) *x3
It gives us a complex set of weight constants multiplied with original inputs. Substituting each group of weights with a new weight gives us the following,
y ̂ = W1*x1 + W2*x2 + W3*x3
which is exactly of the same form to our very first equation from the linear model! We can test this with as many number of hidden layers we want with a huge set of weights. However, after all the complex calculations, eventually our model will collapse into nothing but a linear model.
Let’s look at this from the algebraic perspective for the model in figure 5.
We are basically multiplying multiple matrices in a chain throughout the entire model. If we consider the first hidden layer as a vector H1 with three components h1^{[1]}, h2^{[1]} and h2^{[1]}, represent all the weights between the input and H1 as a weight matrix W1 and x1, x2, x3 as input vector X, then we get the following,
H1 is a 3×1 vector obtained from 3×3 matrix W1 transposed multiplied by 3×1 input vector X. That’s how the values of the first hidden layer neurons are found. To find the second hidden layer neurons, we have to multiply the second layer weight matrix W2 transposed with the resultant matrix obtained from H1.
As we can see, two 3×3 matrices can be combined to a single 3×3 matrix by calculating their matrix products. This still gives the same shape for H2. We can test this with a different number of neurons in the second hidden layer. Let’s say instead of 3, H2 has two neurons. Therefore, the expected shape of H2 vector would be 2×1.
Thus, the shape of H2 still persists and the values for each of the neurons in H2 are rightly calculated. Now, let’s calculate the final output by adding the weight matrix consists of weights between the final layer (i.e. output) and H2. We assume that there are three neurons in H2.
Once again, our large chain of matrix multiplication has collapsed into a single 3-valued vector. In fact, if we train both the linear model and this 2-hidden layer neural net and stop at the same loss value, we will see that the weights learned by the linear model would exactly match with the weights learned by the latter, even after performing huge number of calculations. That means, just stacking up layers after layers in a network is not sufficient to add non-linearity to the model. As long as non-linearity is not added, the model is no better than a linear model, thus cannot be called as ‘deep’ no matter how many layers we have in between.
How to add non-linearity then? By adding a non-linear transformation layer which is facilitated by a non-linear activation function. In other words, in the computation graph in figure 6, we can imagine each hidden layer having two nodes in place of each single node.
The green node is calculating the input coming through the incoming edges, which is a linear function. The output to this node is then passed to the next red node which transforms that linear equation to a non-linear one. Finally, the output of the red node is then passed to the next hidden layer. So, the activation acts as the transition point between two hidden layers. Adding this non-linear transition point is the only way to stop a neural network condensing back to a shallow linear model. There are many activation functions for this transformation like sigmoid, tanh, ReLu etc. Even if we have an activation function between two intermediate layers and don’t have them somewhere else, then those consecutive layers will collapse to a single layer due to the same principal discussed above. For this same reason, we see that most neural networks have activation functions (mostly ReLu) between all the hidden layers and linear for regression and sigmoid or softmax function for classification before the final output.
Let’s compare this new network in figure 6 to the previous network in figure 5 from linear algebra perspective. For an activation function, we add ReLu between H1 and H2. For each of the component in the resultant matrix of W1^{T}X, ReLu will take the maximum of zero and that component. In other words, in the positive domain, the original value retains however in the negative domain we set the value as zero. Because this activation function is applied element-wise to the resultant matrix, there is no way to express f(W1^{T}X) in terms of a matrix in linear algebra. Thus, that portion of the transformation chain can be collapsed into a simpler or linear function. As a result, the complexity of the network remains and does not get simplified into a linear combination of inputs.
However, the final layer and hidden layer 2 or H2 can still be transformed to a linear function because there is no non-linearity added in between.
Therefore, the answer to our initial question is, to consider a neural network as a deep neural network, there must be some kind of non-linearity or complexity being passed between the layers in hierarchical order (i.e. output of hidden layer activation function is passed as input to the next hidden layer). With complexity being pertained, neural network with a single hidden layer can be considered as ‘deep’ whereas without complexity layers after layers stacked to the network will not be enough to make it a ‘deep neural net’.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
]]>
In addition to the above-mentioned resources, there are some extremely important areas AI share and extend concepts with. Three most useful of them are Linear Algebra, Calculus (with extensions to multivariate calculus) and Probability and Statistics. Complete understanding and reasoning of these areas will help you grasp many involved topics of neural networks, deep learning, computer vision etc. Below are some great books and online resources you should check in order to master them.
The best way to learn a concept is to apply them to solve a real problem with hands-on programming. In the fields of AI, you will have to work with a large amount of data, in more advanced level, big data. Therefore, decent idea of software engineering including parallel programming, how to work with CPU, GPU and TPU (Tensor Processing Unit) etc. are equally important. Below are some resources where you will get firm knowledge about these stuffs.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
This article is copyrighted. Please cite this website should you like to reproduce or distribute this article in whole or part in any form.
]]>