FCOS: fully convolutional one-stage object detection
Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: fully convolutional one-stage object detection,” in ICCV, pp. 9627-9636, 2019.
FCOS
- FCOS: A fully convolutional one-stage object detector (FCOS)
- Anchor box free & Proposal free
- Avoids the complicated computation related to anchor boxes s.a. calculating overlapping during training.
- Avoids all hyper-parameters related to anchor boxes, which are very sensitive to the final results.
- With the only post-processing non-maximum suppression (NMS)
- Simple and Flexible
Main contributions
-
통합: Detection is now unified with many other FCN-solvable tasks such as semantic segmentation, making it easier to re-use ideas from those tasks.
hyper-parameters가 detection tasks에 따라 달라지므로, semantic segmentation과 같은 task에서 사용되는 neat fully convolutional network에 잘 맞지 않는다.
-
단순함 & 빠름: Proposal free and anchor free. Avoids the complicated computation related to anchor boxes.
-
확장성1: Can also be used as a RPN in two-stage detector (better performance than standard RPN).
-
확장성2: Can be extended to solve other vision tasks w/ minimal modification.
Related works
Anchor에 따른 detector 분류
Anchor-based detectors (e.g. Faster R-CNN)
- Need many hyper-parameters ⇒ making detection tasks deviate from a neat fully convolutional network architectures used in other dense prediction tasks such as semantic segmentation.
Anchor-free detectors (e.g. YOLO, CornerNet, etc)
-
YOLOv1: predicts bounding boxes near the center of objects.
⇒ (-) Low recall. YOLOv2에서 이를 해결하기 위해 다시 anchor를 사용했음.
-
FCOS:
- Predicts the bounding boxes using all points in a ground truth bounding box.
- Suppresses the low-quality detected bounding boxes using center-ness branch
-
CornerNet: detects a pair of corners of a bounding box and groups them to form the final detected bounding box.
⇒ (-) Requires much more complicated post-processing to group the pairs of corners belonging to the same instance.
-
DenseBox-based anchor free detectors: is unsuitable for generic object detection due to the difficulty in handling overlapping bounding boxes and the recall being relatively low.
핵심 개념
Per-pixel prediction
-
Feature map $F_i$의 $(x,y)$는 이미지의 $(\lfloor \frac{s}{2}\rfloor+xs, \lfloor \frac{s}{2}\rfloor+ys)$에 대응된다, where $s$는 total stride
-
모든 feature map의 pixel $(x,y)$에 대해, target bounding box를 바로 regression한다.
Segmentation에서 하는 것과 유사한 방식이다.
-
Target for classification:
\[c^* = \begin{cases} \text{the class label} & \text{if $(x,y)$ falls into any ground-truth}\\ 0 \text{ (background)} & \text{otherwise} \end{cases}\]Target for regression:
\[\begin{align} l^*=x - x_0^{(i)}, & \ \ t^* = y - y_0^{(i)},\\ r^*=x_1^{(i)} -x, & \ \ b^* = y_1^{(i)} - y. \end{align}\label{target_for_regression}\]- The targets are always positive, so they employ $\exp{(x)}$ to map any real number to $(0,∞)$ on the top of regression branch.
-
장점:
- FCOS can leverage as many foreground samples as possible to train the regressor.
-
Train:
-
Loss function:
\[L(\{\boldsymbol{p}_{x,y}\}, \{\boldsymbol{t}_{x,y}\}) = \frac{1}{N_\text{pos}}\sum_{x,y}{L_\text{cls}(\boldsymbol{p}_{x,y}, c^*_{x,y})} + \frac{\lambda}{N_\text{pos}}\sum_{x,y}{\boldsymbol{1}_{\{c^*_{x,y}>0\}} L_\text{reg}(\boldsymbol{t}_{x,y}, \boldsymbol{t}^*_{x,y}) },\]where $L_\text{cls}$ is focal loss and $L_\text{reg}$ is the IOU loss.
-
-
Inference:
For the input image, 1) forward it through the network 2) obatin the classification score $\boldsymbol{p}{x,y}$ and the regression prediction $\boldsymbol{t}{x,y}$ for each location on the feature map $F_i$.
For the location with $p_{x,y}\geq 0.05$ as positive samples, invert $\text{Eq}.~(\ref{target_for_regression})$ to obtain the predicted bounding boxes.
Multi-level prediction
-
Why:
-
마지막 feature map의 큰 stride값이 상대적으로 low best possible recall (BPR)을 야기한다.
Best possible recall (BPR): 가능한 recall의 최댓값. Precision은 고려하지 않은 recall 값.
Recall (재현율): 실제 positive중 제대로 예측한 것의 비율. 죄다 positive로 예측해버리면 FP와 TP값이 동시에 커지면서 recall이 증가할 수 있다.
-
Low BPR $\approx$ Low recall = 죄다 background라고 예측해버리면 recall값이 감소할 수 있다.
-
큰 stride값을 갖는 feature에서는 작은 물체가 표현되지 않으므로 recall값이 낮아지는 원인이 된다.
-
-
Overlaps in ground-truth boxes can cause interactable ambiguity, i.e. which bounding box should a location in the overlap regress?
-
-
How:
-
FPN과 같이 5개의 레벨로 정의되는 feature maps ${P_3, P_4, P_5, P_6, P_7}$을 사용한다.
-
$P_3$, $P_4$ and $P_5$: 각각 feature maps $C_3$, $C_4$ and $C_5$에 $1\times 1$ convolutional layer를 통과하고 top-down connections해서 얻어짐.
-
$P_6$ and $P_7$: $P_5$에서 부터 2-stride로 convolutional layer 통과시켜서 얻어짐.
⇒ 결과적으로 $P_3$~$P_7$은 각각 strides 8, 16, 32, 64, and 128을 갖는다.
-
-
각 level별로 bounding box regression의 범위를 제한한다. 예를 들어 $i$번째 feature에서의 거리 범위는:
\[m_{i-1}\leq \max{(l^*, t^*, r^*, b^*)}\leq m_i,\]where $m_2, …, m_6 = 0, 64, 128, 256, 512, ∞$. 이 범위를 벗어나는 것은 negative sample로 인식함.
⇒ 이를 통해 대부분의 overlapping은 서로 다른 사이즈의 feature에서 발생한다.
- 그럼에도 불구하고 동일한 feature size에서 겹치는 ground-truth가 발생한다면, 더 작은 넓이를 갖는 ground-truth box를 target으로 선택한다.
-
FPN과 같이 서로 다른 feature level간에 heads를 공유하여 parameter-efficient하게 만들면서 동시에 detection 성능을 향상시켰다.
- 이때, 서로 다른 feature level은 서로 다른 크기의 범위로 regression한다, e.g. $[0, 64]$ for $P_3$ and $[64, 128]$ for $P_4$. 따라서 $\exp{(x)}$가 아니라 $\exp{(s_ix)}$, where $s_i$ is a trainable scalar,를 이용하여 각 feature level $P_i$에 대해 base of the exponential function을 자동 조정했다 (약간의 성능 향상).
-
Center-ness
-
Why:
- 객체의 center와 멀리 떨어진 곳에 존재하는 low-quality predicted bounding boxes가 많이 생성되는 것을 관찰함.
-
How:
-
Hyper-parameters 없이 이러한 low-quality detected boxes를 suppress하기위해 center-ness branch를 추가하였음:
-
Center-ness: 객체의 중심으로부터 normalized distance를 의미. $0\sim1$ 사이의 값을 가짐.
\[\text{centerness}^*=\sqrt{ \frac{\min{(l^*,r^*)}}{\max{(l^*,r^*)}} \times \frac{\min{(t^*,b^*)}}{\max{(t^*,b^*)}} }.\]$\verb sqrt $를 통해 centerness의 감소를 느리게 만든다.
-
-
Train: Binary Cross Entropy (BCE) loss로 학습됨.
-
Test: 예측된 center-ness값을 대응되는 classification score에 곱해서 최종적인 final score를 계산함. ⇒ center 값과 멀다고 판단되는 곳에 대한 classification score의 비중을 줄였음. 이를 통해 마지막 NMS 과정에서 이러한 낮은 score를 갖는 예측점들이 필터링되어 최종적으로 low-quality boxes의 개수가 줄어드는 효과가 있음.
-
기타
- Multi-class classifier 대신에 $C$ binary classifiers를 사용하였다.
Experiments
- Multi-level prediction
- FPN 유무에 따른 BPR 성능 ⇒ FPN하면 BPR이 향상됨
- FPN 유무에 따른 ambiguous samples 비율 ⇒ FPN하면 ambiguous sample이 줄어듦
- Center-ness
- Center-ness 고려 유무, center-ness branch 유무 ⇒ center-ness branch가 있을 때 성능이 개선됨
- Anchor-based vs. Anchor-free
- 동일 세팅에서 anchor-based detector(=RetinaNet) 이김.
- Comparison with other SOTA
- 다른 SOTA와 동일한 backbone에서 해도 성능 개선됨을 보임
Algorithm
class FCOS:
def forward_train():
x = self.extract_feat(img)
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes, gt_labels, gt_bboxes_ignore)
return losses
class AnchorFreeHead:
def forward_single(self, x):
cls_feat, reg_feat = x, x
for cls_layer in self.cls_convs: # [Conv2d, gn, relu] * 4
cls_feat = cls_layer(cls_feat)
cls_score = self.conv_cls(cls_feat) # Conv2d(256, 80, (3, 3))
for reg_layer in self.reg_convs: # [Conv2d, gn, relu] * 4
reg_feat = reg_layer(reg_feat)
bbox_pred = self.conv_reg(reg_feat) # Conv2d(256, 4, (3, 3))
return cls_score, bbox_pred, cls_feat, reg_feat
class FCOSHead(AnchorFreeHead):
def forward(self, x, scale, stride):
cls_score, bbox_pred, cls_feat, reg_feat = super().forward_single(x)
centerness = self.conv_centerness(cls_feat) # Conv2d(256, 1, (3, 3))
bbox_pred = scale(bbox_pred).float()
bbox_pred = bbox_pred.exp()
return cls_score, bbox_pred, centerness
def forward_single(self, x, img_metas, gt_bboxes, gt_labels, ...):
outs = self(x)
loss_inputs = outs + (gt_bboxes, gt_labels, img_metas)
losses = self.loss(*loss_inputs, ...)
return losses
def loss(self, cls_score, bbox_preds, centernesses, gt_bboxes, gt_labels, img_metas, ...):
# ...
# self.loss_cls = focal_loss
loss_cls = self.loss_cls(flatten_cls_scores, flatten_labels, avg_factor=num_pos)
# ...
# self.loss_bbox = iou_loss
loss_bbox = self.loss_bbox(
pos_decoded_bbox_preds, pos_decoded_target_preds,
weight=pos_centerness_targets, avg_factor=centerness_denorm
)
# self.loss_centerness = cross_entropy_loss
loss_centerness = self.loss_centerness(
pose_centerness, pos_centerness_targets, avg_factor=num_pos
)
return dict(loss_cls=loss_cls, loss_bbox=loss_bbox, loss_centerness=loss_centerness)