Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

Lian Xu1
Wanli Ouyang2
Mohammed Bennamoun1
Farid Boussaid1
Dan Xu3

1The University of Western Australia, 2The University of Sydney, 3Hong Kong University of Science and Technology
In CVPR 2022

[Paper]
[GitHub]


Figure 1. (a) In previous vision transformers, only one class token (red square) is used to aggregate information from patch tokens (blue square). The learned patch attentions corresponding to the class token generate a class-agnostic localization map. (b) In contrast, the proposed MCTformer uses multiple class tokens to learn interactions between class tokens and patch tokens. The learned class-to-patch attentions of different class tokens can produce class-specific object localization maps.

Abstract

This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS.


Approach





Try our code

 Recommended version: [PyTorch]




Paper




Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, Dan Xu
Multi-class Token Transformer for Weakly Supervised Semantic Segmentation
CVPR, 2022 (Paper)

[bibtex]



Experiments

Here we show comparison with the state-of-the-art WSSS methods. Please see the paper for more details.

Table 1. Segmentation performance comparison of WSSS methods in terms of mIoU (%) on the PASCAL VOC 2012 val and test sets using different segmentation backbones. Sup.: supervision. I: image-level ground-truth labels. S: off-the-shelf saliency maps.


Table 2. Segmentation performance comparison of WSSS methods in terms of mIoU (%) on the MS COCO validation set.



Figure 2. Visualization of the generated class-specific localization maps by MCTformer on the PASCAL VOC train set.

Testing on video data from DAVIS 2017


Acknowledgements

This research is supported in part by Australian Research Council Grant DP210101682, DP210102674, DP200103223, Australian Medical Research Future Fund MRFAI000085, CRC-P Smart Material Recovery Facility (SMRF) - Curby Soft Plastics, the Early Career Scheme of the Research Grants Council (RGC) of the Hong Kong SAR under grant No. 26202321 and HKUST Startup Fund No. R9253.