FCOS: fully convolutional one-stage object detection

5 minutes to read

Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: fully convolutional one-stage object detection,” in ICCV, pp. 9627-9636, 2019.

FCOS

  • FCOS: A fully convolutional one-stage object detector (FCOS)
  • Anchor box free & Proposal free
    • Avoids the complicated computation related to anchor boxes s.a. calculating overlapping during training.
    • Avoids all hyper-parameters related to anchor boxes, which are very sensitive to the final results.
    • With the only post-processing non-maximum suppression (NMS)
  • Simple and Flexible

Main contributions

  • 통합: Detection is now unified with many other FCN-solvable tasks such as semantic segmentation, making it easier to re-use ideas from those tasks.

    hyper-parameters가 detection tasks에 따라 달라지므로, semantic segmentation과 같은 task에서 사용되는 neat fully convolutional network에 잘 맞지 않는다.

  • 단순함 & 빠름: Proposal free and anchor free. Avoids the complicated computation related to anchor boxes.

  • 확장성1: Can also be used as a RPN in two-stage detector (better performance than standard RPN).

  • 확장성2: Can be extended to solve other vision tasks w/ minimal modification.

Anchor에 따른 detector 분류

Anchor-based detectors (e.g. Faster R-CNN)

  • Need many hyper-parameters ⇒ making detection tasks deviate from a neat fully convolutional network architectures used in other dense prediction tasks such as semantic segmentation.

Anchor-free detectors (e.g. YOLO, CornerNet, etc)

  • YOLOv1: predicts bounding boxes near the center of objects.

    ⇒ (-) Low recall. YOLOv2에서 이를 해결하기 위해 다시 anchor를 사용했음.

  • FCOS:

    • Predicts the bounding boxes using all points in a ground truth bounding box.
    • Suppresses the low-quality detected bounding boxes using center-ness branch
  • CornerNet: detects a pair of corners of a bounding box and groups them to form the final detected bounding box.

    ⇒ (-) Requires much more complicated post-processing to group the pairs of corners belonging to the same instance.

  • DenseBox-based anchor free detectors: is unsuitable for generic object detection due to the difficulty in handling overlapping bounding boxes and the recall being relatively low.

핵심 개념

Per-pixel prediction

  • Feature map $F_i$의 $(x,y)$는 이미지의 $(\lfloor \frac{s}{2}\rfloor+xs, \lfloor \frac{s}{2}\rfloor+ys)$에 대응된다, where $s$는 total stride

  • 모든 feature map의 pixel $(x,y)$에 대해, target bounding box를 바로 regression한다.

    Segmentation에서 하는 것과 유사한 방식이다.

  • Target for classification:

    \[c^* = \begin{cases} \text{the class label} & \text{if $(x,y)$ falls into any ground-truth}\\ 0 \text{ (background)} & \text{otherwise} \end{cases}\]

    Target for regression:

    \[\begin{align} l^*=x - x_0^{(i)}, & \ \ t^* = y - y_0^{(i)},\\ r^*=x_1^{(i)} -x, & \ \ b^* = y_1^{(i)} - y. \end{align}\label{target_for_regression}\]
    • The targets are always positive, so they employ $\exp{(x)}$ to map any real number to $(0,∞)$ on the top of regression branch.
  • 장점:

    • FCOS can leverage as many foreground samples as possible to train the regressor.
  • Train:

    • Loss function:

      \[L(\{\boldsymbol{p}_{x,y}\}, \{\boldsymbol{t}_{x,y}\}) = \frac{1}{N_\text{pos}}\sum_{x,y}{L_\text{cls}(\boldsymbol{p}_{x,y}, c^*_{x,y})} + \frac{\lambda}{N_\text{pos}}\sum_{x,y}{\boldsymbol{1}_{\{c^*_{x,y}>0\}} L_\text{reg}(\boldsymbol{t}_{x,y}, \boldsymbol{t}^*_{x,y}) },\]

      where $L_\text{cls}$ is focal loss and $L_\text{reg}$ is the IOU loss.

  • Inference:

    For the input image, 1) forward it through the network 2) obatin the classification score $\boldsymbol{p}{x,y}$ and the regression prediction $\boldsymbol{t}{x,y}$ for each location on the feature map $F_i$.

    For the location with $p_{x,y}\geq 0.05$ as positive samples, invert $\text{Eq}.~(\ref{target_for_regression})$ to obtain the predicted bounding boxes.

Multi-level prediction

  • Why:

    1. 마지막 feature map의 큰 stride값이 상대적으로 low best possible recall (BPR)을 야기한다.

      Best possible recall (BPR): 가능한 recall의 최댓값. Precision은 고려하지 않은 recall 값.

      Recall (재현율): 실제 positive중 제대로 예측한 것의 비율. 죄다 positive로 예측해버리면 FP와 TP값이 동시에 커지면서 recall이 증가할 수 있다.

      • Low BPR $\approx$ Low recall = 죄다 background라고 예측해버리면 recall값이 감소할 수 있다.

      • 큰 stride값을 갖는 feature에서는 작은 물체가 표현되지 않으므로 recall값이 낮아지는 원인이 된다.

    2. Overlaps in ground-truth boxes can cause interactable ambiguity, i.e. which bounding box should a location in the overlap regress?

  • How:

    • FPN과 같이 5개의 레벨로 정의되는 feature maps ${P_3, P_4, P_5, P_6, P_7}$을 사용한다.

      image-20230824214524854

      • $P_3$, $P_4$ and $P_5$: 각각 feature maps $C_3$, $C_4$ and $C_5$에 $1\times 1$ convolutional layer를 통과하고 top-down connections해서 얻어짐.

      • $P_6$ and $P_7$: $P_5$에서 부터 2-stride로 convolutional layer 통과시켜서 얻어짐.

      ⇒ 결과적으로 $P_3$~$P_7$은 각각 strides 8, 16, 32, 64, and 128을 갖는다.

    • 각 level별로 bounding box regression의 범위를 제한한다. 예를 들어 $i$번째 feature에서의 거리 범위는:

      \[m_{i-1}\leq \max{(l^*, t^*, r^*, b^*)}\leq m_i,\]

      where $m_2, …, m_6 = 0, 64, 128, 256, 512, ∞$. 이 범위를 벗어나는 것은 negative sample로 인식함.

      ⇒ 이를 통해 대부분의 overlapping은 서로 다른 사이즈의 feature에서 발생한다.

      • 그럼에도 불구하고 동일한 feature size에서 겹치는 ground-truth가 발생한다면, 더 작은 넓이를 갖는 ground-truth box를 target으로 선택한다.
    • FPN과 같이 서로 다른 feature level간에 heads를 공유하여 parameter-efficient하게 만들면서 동시에 detection 성능을 향상시켰다.

      • 이때, 서로 다른 feature level은 서로 다른 크기의 범위로 regression한다, e.g. $[0, 64]$ for $P_3$ and $[64, 128]$ for $P_4$. 따라서 $\exp{(x)}$가 아니라 $\exp{(s_ix)}$, where $s_i$ is a trainable scalar,를 이용하여 각 feature level $P_i$에 대해 base of the exponential function을 자동 조정했다 (약간의 성능 향상).

Center-ness

  • Why:

    • 객체의 center와 멀리 떨어진 곳에 존재하는 low-quality predicted bounding boxes가 많이 생성되는 것을 관찰함.
  • How:

    • Hyper-parameters 없이 이러한 low-quality detected boxes를 suppress하기위해 center-ness branch를 추가하였음:

      image-20230824214524854

      • Center-ness: 객체의 중심으로부터 normalized distance를 의미. $0\sim1$ 사이의 값을 가짐.

        image-20230824220600369

        \[\text{centerness}^*=\sqrt{ \frac{\min{(l^*,r^*)}}{\max{(l^*,r^*)}} \times \frac{\min{(t^*,b^*)}}{\max{(t^*,b^*)}} }.\]
        $\verb sqrt $를 통해 centerness의 감소를 느리게 만든다.
    • Train: Binary Cross Entropy (BCE) loss로 학습됨.

    • Test: 예측된 center-ness값을 대응되는 classification score에 곱해서 최종적인 final score를 계산함. ⇒ center 값과 멀다고 판단되는 곳에 대한 classification score의 비중을 줄였음. 이를 통해 마지막 NMS 과정에서 이러한 낮은 score를 갖는 예측점들이 필터링되어 최종적으로 low-quality boxes의 개수가 줄어드는 효과가 있음.

기타

  • Multi-class classifier 대신에 $C$ binary classifiers를 사용하였다.

Experiments

  1. Multi-level prediction
    • FPN 유무에 따른 BPR 성능 ⇒ FPN하면 BPR이 향상됨
    • FPN 유무에 따른 ambiguous samples 비율 ⇒ FPN하면 ambiguous sample이 줄어듦
  2. Center-ness
    • Center-ness 고려 유무, center-ness branch 유무 ⇒ center-ness branch가 있을 때 성능이 개선됨
  3. Anchor-based vs. Anchor-free
    • 동일 세팅에서 anchor-based detector(=RetinaNet) 이김.
  4. Comparison with other SOTA
    • 다른 SOTA와 동일한 backbone에서 해도 성능 개선됨을 보임

Algorithm

class FCOS:
	def forward_train():
    	x = self.extract_feat(img)
		losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes, gt_labels, gt_bboxes_ignore)
        return losses
    
class AnchorFreeHead:
	def forward_single(self, x):
        cls_feat, reg_feat = x, x
        
        for cls_layer in self.cls_convs: # [Conv2d, gn, relu] * 4
            cls_feat = cls_layer(cls_feat)
        cls_score = self.conv_cls(cls_feat) # Conv2d(256, 80, (3, 3))
        
        for reg_layer in self.reg_convs: # [Conv2d, gn, relu] * 4
            reg_feat = reg_layer(reg_feat)
        bbox_pred = self.conv_reg(reg_feat) # Conv2d(256, 4, (3, 3))
        
        return cls_score, bbox_pred, cls_feat, reg_feat
    
class FCOSHead(AnchorFreeHead):
    def forward(self, x, scale, stride):
        cls_score, bbox_pred, cls_feat, reg_feat = super().forward_single(x)
        
        centerness = self.conv_centerness(cls_feat) # Conv2d(256, 1, (3, 3))
        
        bbox_pred = scale(bbox_pred).float()
        bbox_pred = bbox_pred.exp()
        
        return cls_score, bbox_pred, centerness
    
    def forward_single(self, x, img_metas, gt_bboxes, gt_labels, ...):
        outs = self(x)
        loss_inputs = outs + (gt_bboxes, gt_labels, img_metas)
        losses = self.loss(*loss_inputs, ...)
        return losses
        
    def loss(self, cls_score, bbox_preds, centernesses, gt_bboxes, gt_labels, img_metas, ...):
        # ...
        # self.loss_cls = focal_loss
        loss_cls = self.loss_cls(flatten_cls_scores, flatten_labels, avg_factor=num_pos)
        
        # ...
        # self.loss_bbox = iou_loss
        loss_bbox = self.loss_bbox(
            pos_decoded_bbox_preds, pos_decoded_target_preds,
            weight=pos_centerness_targets, avg_factor=centerness_denorm
        )
        # self.loss_centerness = cross_entropy_loss
        loss_centerness = self.loss_centerness(
            pose_centerness, pos_centerness_targets, avg_factor=num_pos
        )
        return dict(loss_cls=loss_cls, loss_bbox=loss_bbox, loss_centerness=loss_centerness)