Deep-Learning Based Object Detection in Crowded Scenes

  • by

Object detection in crowded scenes is challenging. When objects gather, they tend to overlap largely with each other, leading to occlusions. Occlusion caused by objects of the same class is called intra-class occlusion, also referred to as crowd occlusion. Object detectors need to determine the locations of different objects in the crowd and accurately delineate their boundaries. Many cases are quite challenging even for human annotators.

In autonomous driving, there are at least several scenarios where we have to deal with object detection in crowded scenes: vehicle detection in parking lots or city streets, and pedestrian detection in intersections.

Example crowded scenes for pedestrian and vehicle detection (source: CrowdDet and VG-NMS)

If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. 

The challenges of crowd scenes

Crowd occlusion is challenging for object detection for several reasons.

  • When objects overlap heavily with each other, semantic features of different instances also interweave and make sectors difficult to discriminate instance boundaries.
  • Even though detectors successfully differentiate and detect instances, they may still be suppressed by non-maximum suppression (NMS). The vanilla greedy NMS (and even its improved versions soft-NMS and matrix-NMS) follow one implicit hypothesis that detection boxes with high overlaps correspond to the same object and thus need to be grouped and reduced to one box. This assumption works reasonably well, however, it does not hold anymore in a crowded scene where objects heavily occlude and overlap with each other by definition.
  • Many datasets containing or specialized in crowd detection (CrowdHuman, CityPersons, WiderPersons, etc) and real-world applications (autonomous driving) require amodal object detection. That means the detector predicts a box covering the entire object — even if it is not fully visible in the image, for example, due to partial occlusion by other objects. It is actually how humans perceive the environment. This further complicates object detection, as the overlap between amodal bounding boxes are usually much higher than visible bounding boxes.
Visible bounding box vs Amodal bounding box (source: VG-NMS)

The Dilemma of NMS

The most critical challenge of crowd detection is perhaps NMS. As we see below, almost all existing works in crowd object detection works around or directly on NMS.

Although most parts in modern object detectors are end-to-end trainable, NMS remains one of the last human crafted components. NMS greedily selects the bounding box with the highest score and suppresses ones that have a high overlap with it. The overlap is measured by comparing Intersection-over-Union (IoU) threshold to a predefined threshold, usually ranging from 0.3 to 0.5. NMS is quite sensitive to the threshold—a higher threshold means less suppression power and may bring more FP, and a lower threshold means more aggressive suppression and may lead to missed detections.

The blue box shows the missing object, while the red ones highlight false positives (source: Adaptive NMS)

Modified loss for tighter boxes

To reduce the sensitivity of detection results to the NMS threshold, some studies propose new losses to ensure tighter prediction. They propose additional penalties to produce more compact bounding boxes and become less sensitive to NMS. Implicitly they impose additional penalties to bbox which appear in the middle of the two pedestrians, addressing one of the issues for crowd object detection.

RepLoss (Repulsion Loss: Detecting Pedestrians in a Crowd, CVPR 2018) proposes a novel bbox regression loss specifically designed for crowd scenes. This not only pushes each proposal to reach its designed target but also keep it away from other surrounding objects.

The RepGT loss penalizes overlap with non-target GT object. RepBox loss encourages that the IoU region between two predicted boxes with different designated targets needs to be small. This means the predicted boxes with diff regression targets are less likely to be merged into one after NMS.

AggLoss (Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd, ECCV 2018) proposes a new loss term to enforce proposals locate compactly to the designated ground truth object. Concretely, it enforces SL1 loss between the avg prediction of the anchors and the corresponding GT.

Comparison of RepLoss and AggLoss (Diagram made by the author of this blog post)

Note: I am actually a little bit puzzled by the formulation of Agg Loss in the original paper. To me, it would be more reasonable to propose a a consistency loss penalizing different predictions from different anchors matched to the same groudtruth, as shown in the diagram above.

Both RepLoss and AggLoss encourages a tighter bbox through modification of loss function. However, sometimes even tighter detection results will not help in a highly crowded scenario where NMS sets an upper limit for the performance of an object detector. For example, in CrowdHuman dataset, nearly 10% of the groundtruth instances will be missed in detection if applying the standard NMS with IoU threshold of 0.5 (source). In other words, even a perfect detector (100% recall and precision with perfectly tight bounding boxes) will still fail to detect all the instances after NMS.

Occlusion-aware NMS

To achieve better performance of object detection in a crowded scene, the bottleneck of NMS needs to be addressed in a more principled way. Many papers strive to redesign NMS to handle occlusion cases more appropriately while not degrading performance for normal scenarios.

Adaptive NMS (Refining Pedestrian Detection in a Crowd, CVPR 2019 oral) notes that the dilemma of NMS is caused by the forced selection of a single threshold. The adaptive NMS proposed by the paper applies a dynamic suppression strategy where the threshold rises as instances gather and occlude each other and decays when instances appear separately. It predicts the object density score (or crowdedness) online with a separate subnet and uses it as an adaptive threshold for NMS. For objects in a high object density area, use the dynamic threshold of max(fixed_threshold, crowdedness) to perform NMS. This adaptively adjusts up the threshold in crowded regions with a high crowdedness score. That said, crowdedness estimation is a challenging task and there are often inconsistencies between the round truth density and the IoU of the predicted bounding boxes.

A comparison between greedy NMS, soft NMS and adaptive NMS (source: Adaptive NMS)

Double Anchor (Double Anchor R-CNN for Human Detection in a Crowd, Arxiv 2019) is developed to capture body and head parts in pairs. This paper addresses in particular human detection and the intuition behind the paper is simple: compared with the human body, the head usually has a smaller scale, less overlap, and a better view in real-world images, and thus is more robust to pose variations and crowd occlusions. The network is based on Faster RCNN framework and predicts a head box and a body box, each with a confidence score. Then a joint NMS method uses a weighted score from both head bbox score and body bbox score, and boxes with a lower score will be suppressed if either the body overlap or the head overlap exceeds a certain threshold.

Note: In my opinion, it may be a better idea to use head overlap only to perform the NMS, given the intuition of the paper that head parts are less prone to occlusion. Unfortunately the Double Anchor paper did not provide ablation studies on this.

The architecture of Double Anchor (source: Double Anchor)

The intuition of Double Anchor is great but the notion of a head box and a body box is only limited to the context of pedestrian detection. Almost by definition, visible parts of an object suffer much less from occlusion. Can we make the method of Double Anchor more general by redefining the body box and head box as the amodal full box (enclosing the occluded extent) and the visible box?

R2-NMS (CVPR 2020) and VG-NMS (NeurIPS 2019 workshop) did exactly that. These two roughly contemporary studies both predict both full bbox and visible region and use the visible region for NMS. R2-NMS focuses on crowded pedestrian detection, and uses a Faster-RCNN-like two-stage object detection framework, while VG-NMS focuses more on crowded vehicle detection in parking lots or urban scenes, and uses an SDD-like single-stage object detection framework.

Schematic diagram of R2-NMS and VG-NMS (source: R2-NMS and VG-NMS)

CrowdDet (CVPR 2020 oral) predicts multiple detections per anchor for crowd detection. The predicted boxes from the same anchor are expected to infer the same set of instances, rather than distinguishing individual instances as in the single prediction paradigm in most object detectors. A modified set NMS largely follows the normal NMS procedure but skips suppression for prediction coming from the same anchor.

As each anchor now predicts a set of object instances without any particular order, the loss needs to be modified to measure the distance between two sets. EMD (earth mover’s distance) loss is used to select the best matching one with the smallest loss for all permutations of matching. It also adds dummy boxes whose class label is regarded as background and mask out regression loss. These ideas actually closely resemble many of the paradigm-shifting DETR paper, which I will later write a summary about.

Single prediction vs set prediction paradigms (source: CrowdDet)

Note: The above contrived pathological scenario illustrates the fundamental limitation of single-prediction paradigm. In heavily crowded scens, it is intrinsically difficult to predict a single instance from a single anchor as as the proposals share very similar feature. Moreover, after vanilla NMS, it is very likely that only one prediction survives.

Although in autonomous driving and many real-world application, the above contrived case is very unlikely to happen, this is indeed one corner case that modern object detectors are not designed cannot handle, be it one-stage or two-stage, anchor-based or anchor-free. Generally speaking, object detection is to tell where the object is, and how big it is. However, this case has both center location collision (cannot be handled via center heatmap) and size collision (cannot be handled via multi-scale feature maps).

Alternatives to NMS

Now we know NMS is the necessary evil of object detectors, why not get rid of it? There has indeed been a recent wave of anchor-free and NMS-free object detectors, and among the most representative works are CenterNet (Arxiv 2019) for general object detection and CSP (CVPR 2019) dedicated to pedestrian detection.

Although anchor-free methods can eliminate IoU-based NMS in the traditional sense, a local maximum has to be selected in the predicted center heatmap via 2d max-pooling. In essence, this can also be seen as NMS, but instead of depending on IoU, they are based on center distance.

The extremely simple pipeline for pedestrian detection of the anchor-free CSP (source: CSP)

Confluence provides an alternative to NMS, and seem to work very well for crowded scenes. Confluence embraces the tendency for dense object detectors to return numerous bounding boxes around the ground truth (more on this in a future blog post) and uses the heavy cluster of bounding boxes as an indicator of the presence of an object. Both greedy NMS and confluence identify representative boxes first and suppress the neighboring one with a certain threshold.

  • Retention: Greedy NMS sorts candidate boxes by classification confidence score, while confluence sorts them by confluence scores. The confluence score is based on Manhattan distance and characterizes how closely one box agrees (is confluent) with its neighbors.
  • Removal of duplicates: Greedy NMS uses IoU to remove duplicates boxes wrt the retained box. Confluence uses normalized Manhattan distance as the measure.
object detection
Confluence finds clusters of bounding boxes and finds the most confluent box as the final prediction

NMS’s reliance on maxima confidence score causes it to return suboptimal bounding box (the best locating box may not have the highest classification score) and the reliance on a fixed threshold for duplicate removal often leads to suppression of true positives, especially in crowded scenes.

Note. The paper is highly interesting but unfortunately is not very clearly written. We will have to wait until the authors open source their code to see more implementation details.

Densely packed scenes without occlusion

In all the above studies, we assume crowded scenes with occlusions. A related but slightly different field is object detection in densely packed scenes, such as in a shelf display. In such retail scenes, many objects appear similar or identical and are often positioned in close proximity, but without too much occlusion. General object detectors would also fail miserably here. SKU110K (CVPR 2019) proposed an EM-merger algorithm to replace NMS to filter, merge and split overlapping detection clusters to resolve a single detection per object.

Comparison between RetinatNet and the EM-merger algorithm (source: SKU110K)

Perhaps a hindsight, but it appears to me that object detection in non-occlusion crowded scene can be quite readily solved by a key-point based anchor-free method such as CenterNet. Similar to crowd occlusion cases, densely packed scenes also requires tighter bounding box, otherwise NMS will inaccurately suppress true positives and spares false positives.So methods that encourages a tigher prediction box such as RepLoss and AggLoss should also help.


  • NMS is an important building block for modern object detectors. NMS has a fundamental assumption that high overlapping predictions need to be suppressed but one. However, crowded scenes by definition challenge this assumption.
  • Almost all existing works in crowd object detection works around or directly on NMS. RepLoss and AggLoss encourage tighter bounding box prediction, which alleviates the sensitivity to the NMS threshold.
  • Adaptive NMS dynamically predicts the NMS threshold to be used in inference. Joint-NMS (Double Anchor), R2-NMS, VG-NMS predicts two boxes per instance and use the box less prone to occlusion during NMS. Set NMS (CrowdDet) fundamentally addresses corner cases with location and scale collision that most modern object detectors are not designed to handle. The idea of predicting a set of bounding boxes is similar to that of DETR and could be the future of object detection beyond the current paradigm of dense prediction.
  • In practice, the method used by R2-NMS and VG-NMS seems to be the most practical way to go to handle occlusion and should handle most crowd occlusion cases in reality. Visualization of detection results before NMS also appears to be a powerful debugging tool.


This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more computer vision updates.

We’ll let you know when we release more technical education.

The post Deep-Learning Based Object Detection in Crowded Scenes appeared first on TOPBOTS.

Leave a Reply

Your email address will not be published. Required fields are marked *