Wednesday, 2 September 2020

Not long ago , I have started reading object tracking publications. Soon I realized , this aspect of Computer Vision  has a lot of basic concepts and terms , that are different from Image classification or Object detection. So, I had to dig a lot. This post is to save the time that you were going to spend digging. 

We are going discuss about a lot of topics here.  SO, lets take an overview. 
  • Main theme of object tracking
  • Main components (Object Detection and Re-IDentification)
  • Types and variations (Online vs. Offline, One Step vs Two Step, Anchor Based vs Anchor free)
  • Famous Object tracking datasets
  • State-of-the-art models
  • Object tracking metrics
So, I hope you understand, This is going to be a long journey, I even may end up dividing it in two parts. So grab a coffe and lets jump in...

Main theme of object tracking
Object tracking has been a longstanding goal in computer vision. The goal is to estimate the trajectories of multiple objects of interest in videos. The successful resolution of the task can benefit many applications such as action recognition, sport videos analysis, elderly care, and human computer interaction.
Unlike image classification(which is considered to be one of the solved problem of CV, that's why the focus is moving to other fields , like 3D image classification), object tracking is considered to be an active research field of Computer vision. 
Just like we learn to walk before running, Our first goal in object tracking was tracking a single object in a video (Single Object Tracking - SOT) .




 As time passed, we moved to more complicated tasks, as a consequence, today we are mostly dealing with tracking multiple objects like cars, persons etc. (Multiple Object Tracking - MOT). 



Hopefully you can understand, this is considerably a difficult task than the previous one mentioned, as all persons and all cars look almost similar from a distance footage. 
Yet there are more problems. Often we see some objects are occluded from the footage, like a person  getting into a car or getting out of  a car. Besides, often the person or object moves so fast that its position changes considerably between two subsequent frames. So the model has to initiate or close a trajectory in the middle of a video footage. We are trying and also succeeding to overcome these and many other obstacled, and to make tracking computation and memory efficient.

Main components
In machine learning some problems are solvable in one shot, like image classification ( The total process can be completed only with one convolution network). Some are not, like Object detection(some network is used for extracting features from image, another for specifying the location of an object, some other for determining the best height and width for bounding the object with a box). One shot processes are faster, so researchers are finding methods to execute all the processes in single shot.
The main process of object tracking is divided into two tasks. Firstly detecting all objects in each frame of a video, and identifying a specific detection in each frame (actually only in those frames where that specific object is present) of that video.
Object Detection: This part is simply detecting various objects in an image (frame of video), and drawing bounding box around that. This task is often done in two or three steps.Object detection , by itself is such a huge and evolving field of computer vision.Some of the best models for  this task are : YOLO(v1....v5), Faster R-CNN, SSD  and a lot others. But for now , were dont need to dive deeply here.

Re-ID: Now you must be thinking, isn't detection enough? Isn't out target detecting all objects trough all frames? Sadly no. We are yet to assign a detection to a specific object(ID), to track its  movement and finally finding the trajectory of that object (and similarly to every individual object in the video). That is the task of the process Re-IDentification, Identifying the same objects again and again in every frame.
There has been a lot of work its improvement, yet we haven,t got the best of re-Identification. A vary populur approach for this is use of Seamese Network. Lets describe it briefly.HEre we usually train the model to learn similarity between two images. We will input two images to exactly same convolutional model, and the model is set to output the similarity score between those images. It will give a higher similarity score if those two input image represent the same object. Otherwise, low score. For efficient convergence, triplet loss or group loss are used as loss function.(Lets discus the loss functions some other day, but you can always google)Do you know, this is one of the most used algorithm for famous Face verification process, that we often encounter at ouur phone? and seamese network is cameble of learning with very small training dataset, that's why it is considers few shot learninig algorithm.




Types and variations
Like any other thing creation in this mortal world, object tracking algorithms also have a number of variations. Some variations are for making it fast, some changes make it computational cost efficient, and some  are just to be there. We will dis cuss about two or three important ones.

Online vs. Offline: Some tracking models are completely trained on stored dataset whereas some can be trained on live streaming footage. Both method has some advantages and some shortcomings.
Models that are suitable for training on live streaming video, are usually vary fast. Actually they are bound to be fast, otherwise it is impossible  to cope up with the video frame rate. of this speed, these models can be used for live object tracking. Some important online models are:








 

Saturday, 25 July 2020

EfficientNet: A new approach to Neural Network scaling

Recently I came across a paper EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksThey propose some amazing approach for scaling (increasing depth, width and resolution) neural network models. Here I will discuss the approach and try to build the intuition behind this approach. For mathematical and deeper insight, I recommend reading the original paper which is linked below.
You may have heard of EfficientNet as a image classification algorithm. Then why am I referring this as an approach, not an algorithm,which one is it? Lets find out!
First things first.Why EfficientNet? EfficientNet models (or approach) has gained new state of the art accuracy for 5 out of the 8 datasets,with  9.6 times fewer parameters on average.
At the paper, the authors firstly find out the relationship between the accuracy and the scaling(size) of a model.It is found that , performance increasing along with increase of width(number of channels used in each layer), depth(how many layers in the model)  and resolution(size of the input image).
We can see, at every case, initially the accuracy increases dramatically (of course along with the computational cost), but after sometime the curves almost flatten.
At this case, where depth is increased, everything else(width, resolution) is kept fixed. But the performance doesn't saturate so easily, if we increase all of them simultaneously and that's the first key observation in that paper.
Intuitively also, this makes a great sense. Because, for higher resolution image we can extract more features with a deeper neural network. Similarly, with increased width of the model, more fine grained  patterns can be captured from a higher number of pixels. From this we can sense that, co-ordination and balance between different scaling dimensions can help us achieve greater performance.
Generally this balance is tried to achieve by arbitrary change and manual tuning. But here we will take a more methodological approach. And this approach, compound scaling method is the second key observation of this paper.
Without getting into the mathematical details (for which I will recommend reading the original paper linked below), we can say that our authors have found a way , that if we increase the depth, width and height by a power of N,  our computation cost will increase by a power of 2 to the power N. And that is the compound scaling method.
By applying this concept and some nitty-gritty math details, authors first scaled up some existing state of the art ConvNets, like MobileNets and ResNets, which showed great improvement over their authentic performance, as well as their manually scaled versions.
Then ,they set out to a journey, for developing their own optimized model. By using some recently developed neural architecture search methods, at first a baseline model EfficientNet-B0 was created. Then it was scaled up by using previously discussed compound scaling method, developing the other versions, from EfficientNet-B1 to EfficientNet-B7. And that's what did the magic!!


Here is their performance on the ImageNet dataset compared with models of similar Top-1/Top-5 accuracy(If you aren't sure what the Top-1/Top-5 accuracy,  just take the Top-1 accuracy as the regular household accuracy,besides a little Google/Youtube search doesn't hurt).
And that's even not the whole story, the real magic is on the rightmost column.If  the term Flops is unknown,take that as number of total addition or multiplication operation needed for the process (again a little googling, you may search FLOPS vs FLOPs), actually it is the measure of computational power needed. And do you find something amazing?? 
Yap! So do I. The dramatic difference between EfficientNet models and their comparable models.Thing how much computationally efficient these EfficientNet models really are. Are they really comparable in those groups??

So, that's almost the story of one one the most powerful and scalable image classification models in recent years. Although a lot is left, like use of EfficientNet in transfer learning, as well as inference Latency comparison with other models. But lets leave that for your reading the paper.
Hope you get the main idea behind this amazing approach. Do you know , our EfficienNet has a detective brother, EfficientDet!! But that's a story for a different day.
Thanks for reading till here.You can read the original paper at https://arxiv.org/pdf/1905.11946
This is actually my first blog as well as first paper review. So, if you have any suggestions for me, I'm always open to them. Stay safe!!!