NeurIPS 2020 Papers: Takeaways for a Deep Learning Engineer –  Computer Vision

  • by

As mentioned in part 1– the most important thing:) – I went through all the titles of NeurIPS 2020 papers (more than 1900!) and read abstracts of 175 papers, and extracted DL engineer relevant insights from the following papers.

This is part 2. See part 1 here.

If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. 

Rethinking Pre-training and Self-training

Using other datasets to better solve the target dataset is ubiquitous in deep learning practice. It could be supervised pre-training (Classification; ImageNet pre-trained) or self-supervised pre-training (SimCLR on unlabeled data) or self-training.

(Self-training is a process where an intermediate model (teacher model), which is trained on target dataset, is used to create ‘labels’ (thus called pseudo labels) for another dataset and then the final model (student model) is trained with both target dataset and the pseudo labeled dataset.)

Building on the previous work, the current work shows that the usefulness of ImageNet pre-training (starting with pre-trained weights rather than random) or self-supervised pre-training decreases with the size of the target dataset and the strength of the data augmentation. ImageNet pretraining didn’t help, rather hurt in some cases, the model when training on COCO dataset for object detection.

But, self-training helped in both low-data and high-data regime and with both strong and weak data augmentation strategies. It helped when pre-training didn’t help and showed improvement on it when it did.

ImageNet pre-training vs Self-training when the strength of the data augmentation is changed. Image from the pdf of the current paper.
NeurIPS 2020 papers
Descriptions for Labels in the images above and below. Image from the pdf of the current paper.
Random Init vs ImageNet pre-training vs Self-training. Self-training helps when pre-training doesn’t help and improves on it when it does help. Image from the pdf of the current paper.

Takeaway: When you want to leverage other datasets in training a model on a target dataset, use self-training rather than ImageNet pre-training. But keep in mind that self-training takes more resources than just initializing your model with ImageNet pre-trained weights.

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Different object detection models employ different intermediate representations from which the bounding box predictions are made.

For example, RetinaNet uses a bounding box (anchors) representational format, where it creates feature maps for each bounding box instance created by anchor boxes at each position of the feature grid. If a feature grid is of H x W, RetinaNet takes 9 anchor boxes (pre-specified aspect ratios) for each position of the feature grid giving us 9 x H x W bounding box instances to do IOU thresholding, predicting the classes and sub-pixel offsets, and do NMS on top among other things to get the final set of bounding boxes for an image.

Different intermediate representations for different models. Image from the pdf of the current paper.

FCOS and CenterNet use a center point as representation formats and estimates bounding boxes by predicting x and y dimensional offsets from the center point. And it has all the other processing steps very similar in objective with RetinaNet or any other object detection models.

CornerNet instead uses corner points as representation format (top left and bottom right) and creates a bounding box with those corner points.

Different representations are prevalent in object detection because each representation is good at some specific thing compared to all others. Bounding box representation is better aligned with annotation formats of datasets and is better at classification. Center point representation is better for detecting small objects. Corner point representation is better at localization.

This current work aims to combine the strengths of all these different representations. For a particular object detection model, they improve the features of its primary representation, bounding box for RetinaNet, by also taking into account features from other auxiliary representations, here, they are center points and corner points.

Illustration of working of BVR combining bounding box representations, center representations, and corner points representations. Image from the pdf of the current paper.

The author proposed a Transformer model. When given a feature vector of primary representation for a location on a feature grid (query) it calculates attention weights with feature vectors of auxiliary representations at relevant locations and returns a weighted average of these auxiliary representations.

The model, called Bridging Visual Representations (BVR), will use both the feature vector for primary representation and the weighted average of feature vectors from auxiliary representations to do classification and localization thus combining the strengths and expressive power of different representational choices.

RelationNet++ outperforms every other method. Image from the pdf of the current paper.

Takeaway: This is the state-of-the-art model and it makes sense. Any approach which combines the strengths of multiple solutions non-trivially would be valuable for a long time. Use this method when you train your next object detection model. (Too many good things for object detection!)

Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning

Without a downstream task, it is hard to quantitatively evaluate image representations, i.e. the clusters formed with image representations for their semantic coherence and natural language describability.

This work formulates these tasks, learnability and describability of the clusters, as a forced-prediction problem and evaluates humans as predictors avoiding the issue of subjectivity which is a major problem with existing approaches. (Even though clusters are coherent, sometimes they can’t be described and even though they are describable different person might use different words and phrases).

After seeing a few samples of a cluster, a human should be able to discriminate images of that cluster among images of other clusters. This means that clusters are separated in a human-interpretable way. The extent to which a human can do this is the metric for learnability.

Image from the pdf of the current paper.

After seeing the description of a cluster, a human should be able to discriminate images of that cluster among images of other clusters. This means the given cluster is describable. (Description is sampled randomly from a manually populated set of descriptions for that cluster). The extent to which a human can do this is the metric for describability.

Clusters and their descriptions from self-supervised model SeLa. Image from the pdf of the current paper.

Authors also created a model to get automated descriptions for a cluster so that it could replace the human in the above describability metric.

Takeaway: If you have clusters of images with no labels, the extent to which you could discriminate other images as the same class or not, after seeing the images of a particular cluster, is a good metric to see whether your clusters are separated. The same goes for describability.

A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection

There are a lot of outstanding problems to deal with in object detection. Prominent among them dealt with this work are:

  • Class imbalance problem between foreground/background (positive/negative) bounding boxes. Focal loss in RetinaNet helps but not enough.
  • Difficulty in tuning hyperparameters in the loss function (faster-RCNN has 9 of them to tune).
  • Discrepancy created by having separate localization and classification heads, for eg. classification loss is not dependent on the IOU or localization of that object.

And this is how they deal with it:

  • The ranking based loss function for classification is more stable and learns without overfitting when compared to Cross-Entropy, or weighted cross entropy variant like a focal loss.
Comparison of the number of hyperparameters in each loss function. Image from the pdf of the current paper.
  • The proposed loss function has one hyperparameter and even that parameter doesn’t need tuning. (Results in the paper are without any tuning, it still outperformed baselines).
aLRP loss (the current paper) is stable and not overfitting compared to Cross-Entropy Loss and Focal Loss. Image from the pdf of the current paper.
  • As the proposed ranking-based loss function is not-differentiable, the authors provided equations for gradients of the loss function with respect to the parameters in the localization head and in the classification head. Here, the gradient update on parameters of the classification head is affected by the result of both classification head and localization head as well (and vice versa) so that an instance with less IOU with ground truth is penalized even though the ground truth class label is predicted confidently by the classification head. This makes the classification head work well where the localization head works well and vice versa, which gives more capacity to the model to get better at both precision and localization.
Gradients wrt parameter in classification head contain outputs of localization and vice versa. Image from the pdf of the current paper.
aLRP Loss beats other losses for Faster R-CNN. Image from the pdf of the current paper.

Takeaway: Stability when training and having fewer hyper-parameters to tune is much desired in practice. I can remember a lot scenarios where results are not reproducible. This type of work would be more valuable for a deep learning engineer and I recommend one using it when training your next object detection model.

Disentangling Human Error from the Ground Truth in Segmentation of Medical Images

Labeling in the medical image domain is cost-intensive and have a large inter-observer variability. A method that combines annotations from different annotators while modeling an annotator across images so that we can train with only a few annotations per image is desirable. This is that method.

Given an image with 3 ground truth masks labeled by three different annotators A1, A2, and A3, this work, which also models biases of each annotator, tries to predict three different versions of segmentation masks one for each annotator and tries to backpropagate the loss between these 3 predicted masks and 3 ground truth masks.

Image from the pdf of the current paper.

As these annotator-specific segmentation masks are created with distortion (confusion matrix for each annotator) from the estimated true label which is predicted first, we would take the segmentation mask of the estimated true label as the prediction from the model during inference.

Takeaway: If your application has more inter-observer variability and you have the bandwidth to get multiple annotations per image, this seems to be the go-to right now to get one ground truth out of many.

Variational Amodal Object Completion

Predicting segmentation maps for a complete object when it is occluded is called Amodal Object Completion.

This work presents Amodel-VAE, which encodes the partial mask into a latent vector and predicts a complete mask decoding that latent vector. This work doesn’t require full-object segmentation annotations for training making it desirable as previous works needed complete segmentation masks annotated.

Image from the pdf of the current paper.

To train without complete masks, they carefully train Amodel-VAE in three stages.

  • At stage I, a decoder P(y_complete/z) is pre-trained with only masks that are complete thus learning a mapping from latent vector space to the space of complete masks.
  • At stage II, occluded partial masks are synthetically generated from a complete mask by randomly overlaying other objects (foreground) on it so that we will have a mapping between partial masks and complete masks. A VAE is trained with a pre-trained and frozen decoder to learn an encoder P(z/y_partial).
  • Finally, at stage III, encoder P(z/y_partial) is fine-tuned so that it could encode more complex occlusions which occur in the real-world dataset while the loss is propagated from the visible part of the object expecting the decoder to predict P(y_vis/z). (Decoder is not trained in this step. And the encoder is trading off some capability to produce latent vectors that predict full masks to its capability of encoding real-world/complex occlusions)
Different tasks made possible by Amodal-VAE. Image from the pdf of the current paper.

Takeaway: Practically, knowing the complete locations of objects in occlusion would help to track multiple people and decrease Id-swaps that we see even in SOTA tracking models. It should be interesting if you want to smart photoshop as well. More importantly, this is kind of a problem where use cases are limited only by our creativity.

RandAugment: Practical Automated Data Augmentation with a Reduced Search Space

Automated data augmentation needs to find the probability of each transformation and the magnitude to be used for each of these transformations.

With large possible values for probabilities and magnitudes for each of the transformations, search space becomes intractable. Recent method AutoAugment used RL to find an optimal sequence of transformations and their magnitudes. More recent variants of AutoAugment tried to make use of more efficient learning algorithms to find the optimal sequence of transformations efficiently.

Image from the pdf of the current paper.

Image from the pdf of the current paper.

Nonetheless, the number of iterations of training a model with a set of transformations to find the optimal probability and magnitude values for transformations is still intractable in practice if we are doing it on large-scale models and large-scale datasets. So, proxy tasks are set up, with small models and less data among other tweaks, representative of the target task. Optimal probabilities and magnitudes are found on proxy tasks and are used for the target task.

But that these proxy tasks are not actually representative of the complete target tasks. This work showed that the “optimal magnitude of augmentation depends on the size of the model and the training set.

Now, to make this optimal policy search feasible, this current work proposed RandAugment which is just a grid search on two parameters with ~30 orders of magnitude smaller search space. This is, for sure, one of the few simple-but-powerful and back-to-basics kinds of work you could find.

First, RandAugment picks transformations with uniform probability. Because they observed that optimal policies from AutoAugment are making the dataset visually diverse rather than selecting a preferred set of particular transformations (different probabilities for different transformations).

Second, RandAugment has the same magnitude for all the transformations. Because they observed that optimal policies from an AutoAugment variant had similar magnitudes for all the transformations.

After these adjustments, automated data augmentation became a simple hyperparameter tuning task which could be done with a grid search and the whole algorithm might be written comfortably in 3 lines.

3-line code for RandAugment. Image from the pdf of the current paper.

Takeaway: Automated data augmentation evolved to a point that it is feasible to use in our ‘everyday’ models. If you have resources to do hyperparameter tuning, tune these two parameters (N and M for number of transformations and their global magnitude) as well and get state-of-the-art results.

Learning Loss for Test-Time Augmentation

Let’s assume you want to test your model on a rotated image and images in your training set are never rotated or rotation data augmentation is not used while training. The best possible thing we could do is to do the rotation now at test time to make the images not rotated. And with 10 commonly used and naturally occurring transformations this could happen without you knowing.

So, what is the solution? While training, have a separate network that predicts the loss of a model for each of the transformations if applied to the image.

Using this model, apply only the transformations which give lower loss values at test time.

Proposed test-time augmentation (b) using a loss prediction model while inference. Image from the pdf of the current paper.

Takeaway: Didn’t train your model with necessary data augmentations? Want the best possible results on the test set? Use the above test-time augmentation.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

The post NeurIPS 2020 Papers: Takeaways for a Deep Learning Engineer –  Computer Vision appeared first on TOPBOTS.

Leave a Reply

Your email address will not be published. Required fields are marked *