This post is a continuation to my previous post Object Detection with Single Shot Multibox Detector. If you have not read the first part, I recommend you to read that first for a better understanding. In this post, along with some important concepts regarding training, I have talked about my observation of the model complexity which I tested during my master’s thesis. Please note that, there are very few resources available regarding the timecomplexity analysis of neural network models. However, in this post, you will find extensive discussion on this topic based on my empirical observations.
SSD network is trained endtoend for multilearning of classification and localization simultaneously. It starts with positive and negative training samples obtained from the matched priors. The global objective is to start with positive samples as predictions and being trained to regress the predictions closer to the matched groundtruth boxes. At each step for each positive sample for each grid cell it generates the following output,
 A probability vector of length c + 1, where c is the number of classes and one background class that indicates no object
 A vector with 4 elements [x; y; width; height] representing the offset required to move the prior position to fit the groundtruth bounding box
After each training step, it keeps the priors for those the global loss function is reduced and readjust them to better match with the groundtruth boxes. The model converges when either the difference between the priors and groundtruth boxes are close to zero or the loss curve remains steady for a long time (i.e not decreasing anymore). In this post, I would like to explain three important concepts which anyone needs to understand in order to train their object detection model with SSD.

Loss Function
The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss (conf) where localization loss is the mismatch between the predicted boundary box and the ground truth box and confidence or classification loss is the loss in assigning class labels to predicted boxes. Therefore, the equation for total objective loss is defined in Equation 1.
\[ L\left (x, c, l, g \right) = \frac{1}{N}\left (L_{conf}\left ( x, c \right) + \alpha L_{loc}\left (x, l, g \right)\right)\;\;\;(1) \]
Here, \( x \) is 1 if the prior is matched to the determined ground truth box, and 0 otherwise, \( N \) is the number of matched priors, \( l \) is predicted bounding box, \( g \) is groundtruth bounding box, \( c \) is class, \( L_{conf} \) is condence loss, \( L_{loc} \) is localization loss, and α is the weight for localization loss. Usually SmoothL1 loss is used for localization on \( l \) and \( g \) and Softmax loss is used for optimizing confidence loss over multiple class confidences \( c \). Details of the derivation of the loss function is described in the original paper of SSD.

L2Regularization
L2regularization is a method used for reducing model complexity and overfitting on training data. During training, few weights can get too large and thus influence the output prediction heavily by overpowering rest of the weights. It makes the model lose its generalization ability and overfit on training data. L2regularization penalizes large weights and makes sure the model learns smaller but consistent weights throughout all the layers instead of clustering around few activations. L2regularization is added to the loss function like following equation,
\[ L = \frac{1}{N}\sum_{i=1}^{N}L_{i} + \lambda \sum_{i}^{ }\sum_{j}^{ }W_{i,j}^{2}\;\;\;(2) \]
Here, \( N \) is the number of classes, \( L_{i} \) is the loss of class \( i \), \( W_{i,j}^{2} \) is the L2regularization function that calculates sum of squares of each weight from the weight matrix W and λ is regularization hyperparameter which controls the amount or strength of regularization to be added.

Batch Normalization
Batchnormalization is another important regularization method that accelerates the model training by allowing it to use higher learningrates. While training, large dataset is passed as batches due to memory constraint. The distribution of the batches usually differ from each other which significantly affects the behavior of the model. This change in input distribution is called Covariate shift. Neural networks change the weights of each layer at each iteration at each minibatch of input. Due to covariate shift, layer activations change quickly to adapt to its changing inputs. Since the activations of a previous layer are passed as output to the next layer, eventually all the layers are affected inconsistently for each batch. This results in longer time for model convergence as well as oscillating around a local minima forever. One way to avoid this case is to normalize activations of each layer to have zero mean and unit variance. It helps each layer to learn on a more stable distribution of inputs in all the batches which speedsup the model training. However, forcing activations of all layers to be fixed mean and variance can limit the underlying representation of input data. Therefore, instead of keeping mean and variance fixed, batch normalization allows the model to learn parameters γ and β that can convert the mean and variance to any value based on input data and model architecture. The equations for calculating γ and β are explained in the original paper.
Model Complexity
Majority of the literature report time complexity of deep learning models in terms of total time taken by the model for training and inference when run on specific hardware. This is different than traditional algorithms where timecomplexity is defined by the order of input size using bigoh notation. Standard deep learning models perform millions of matrix multiplication, convolution, product and summation which is computationally very expensive. However, most of these calculations can be performed in parallel because at each layer of the network thousands of identical artificial neurons perform the same computation. This uniformity can be leveraged by using Graphical Processing Unit (GPU) due to its parallel computing architecture, large number of computational unit and higher bandwidth to retrieve from memory. I have implemented SSD MobileNet during my master’s thesis, the paper can be found here. In my work, I have observed that training my SSD model in Nvidia GeForce GTX 1070i GPU has sped up total training time by 10 times than training on fivecore CPU. Although, the model complexity discussed here is for object detection model, it is very much applicable for any kind of neural network based model.
Apart from calculating running time on specific hardware, I have also made an approximation of the asymptotic complexity to get an idea of which segment of the model takes up the largest amount of time while training and inference. Forward pass of the base CNN is dominated by the time complexity of matrix multiplication. Number of multiplication to perform depends on size of input channel (or image resolution), number of convolutional layers, number of neurons in each layer, number of filters and size of each filter in each layer and the size of output feature map. Except for the input layer, size of input channel and output channel is determined by the number of neurons in each layer. In addition, at each layer there is an activation function which can be linear or quadratic based on the function. As ReLu activation function performs elementwise calculation, it runs in quadratic time on each neuron on each layer. All these are parameters or weights to be learned during training. Therefore, runtime of a convolutional model scales linearly with the total number of learnable weights. The total timecomplexity of convolution layers for forward pass is as followed,
\[ \begin{align} \begin{split}\label{eq:1} time_{forward} & = time_{convolution} + time_{activation}\\ & = O(\sum_{l=1}^{L} n_{l1}\cdot (f\cdot f)\cdot n_{l}\cdot (m_{l}\cdot m_{l})) + O(L\cdot n_{c})\\ & = O(weights) \end{split} \end{align}\;\;\;(3) \]
Here, \( l \) is the index of a convolutional layer, \( L \) is the total number of layers, \( n_{l1} \) is the number of input channels of the \( l_{th} \) layer, \( f \) is filter height and width, \( n_{l} \) is the number of filters in \( l_{th} \) layer, \( m_{l} \) is the size of output feature map and \( n_{c} \) is number of neurons in a layer which in practice varies for each layer. For training, there is also backpropagation cost which also runs to calculate weights for all the parameters using chainrule. Thus, \( time_{backprop} \) is also O(weights).
Apart from the above, there are additional computations like batch normalization, dropout, regression and classification using sigmoid function etc. but they take up only 510% of the total training time, as tested with Chromium Event Profiling Tool. Once the model is trained and all the weights are learned, inference just performs summation of the product of weights in linear time. Memory complexity of the model is linear to the number of weights to be stored, thus O(weights).
Model Evaluation
Accuracy of the trained models are measured for both classification and localization. The accuracy metric used in called Mean Average Precision or mAP. AP is calculated for each class from the area under precisionrecall curve. Precision measures the percentage of the model predictions that are made correct and recall measures the percentage of all the possible groundtruth boxes being detected.
\[ \begin{align} \begin{split}\label{eq:pre_recall} Precision = \frac{TruePositive}{TruePositive + FalsePositive} \\ Recall = \frac{TruePositive}{TruePositive + FalseNegative} \end{split} \end{align} \;\;\;(4)\]
Here, \( TruePositive \) is the number of predictions that has IoU greater than 0.5 with groundtruth boxes, \( FalsePositive \) is the number of predictions with IoU less than 0.5 with groundtruth boxes and \( FalseNegative \) is the number of groundtruth boxes that are not detected by the model.
Now predictions are sorted by their confidence score from highest to lowest. Then 11 different confidence thresholds called ranks are chosen such that the recall at those confidence values have 11 values ranged from 0 to 1 by 0.1 interval. The thresholds should be such that the Recall at those confidence values is 0, 0.1, 0.2, 0.3 till 0.9 and 1.0. Average Precision or AP is now computed as the average of maximum precision values at these chosen 11 recall values. Therefore, AP for class \( c \) is defined as by Equation 5 where \( P(r) \) is the precision value for one of the 11 recalls r,
\[ AP_{c} = \frac{1}{11}\sum_{r\epsilon \left \{ 0.0,…1.0 \right \}}^{R} max(P(r)) \;\;\;(5)\]
Then mAP is simply the average of APs over all the classes as shown in Equation 6, where \( C \) is the total number of classes and \( AP_{c} \) is \( AP \) for class \( c \).
\[ mAP = \frac{1}{C}\sum_{c}^{C}AP_{c}\;\;\;(6) \]
The greater the mAP, the better the model is in terms of accuracy.
I hope you have found this article useful. Please feel free to comment below about any questions, concerns or doubts you have. Also, your feedback on how to improve this blog and its contents will be highly appreciated. I would really appreciate you could cite this website should you like to reproduce or distribute this article in whole or part in any form.
You can learn of new articles and scripts published on this site by subscribing to this RSS feed.
Like!! Great article post.Really thank you! Really Cool.
Hi ปั้มไลค์, thanks for your kind words! Glad that it helped you 🙂