PPFormer: Patch Prototype Transformer for Semantic Segmentation
-
Graphical Abstract
-
Abstract
Since the introduction of vision Transformers into the computer vision field, many vision tasks such as semantic segmentation tasks, have undergone radical changes. Although Transformer enhances the correlation of each local feature of an image object in the hidden space through the attention mechanism, it is difficult for a segmentation head to accomplish the mask prediction for dense embedding of multi-category and multi-local features. We present patch prototype vision Transformer (PPFormer), a Transformer architecture for semantic segmentation based on knowledge-embedded patch prototypes. 1) The hierarchical Transformer encoder can generate multi-scale and multi-layered patch features including seamless patch projection to obtain information of multi-scale patches, and feature-clustered self-attention to enhance the interplay of multi-layered visual information with implicit position encodes. 2) PPFormer utilizes a non-parametric prototype decoder to extract region observations which represent significant parts of the objects by unlearnable patch prototypes and then calculate similarity between patch prototypes and pixel embeddings. The proposed contrasting patch prototype alignment module, which uses new patch prototypes to update prototype bank, effectively maintains class boundaries for prototypes. For different application scenarios, we have launched PPFormer-S, PPFormer-M and PPFormer-L by expanding the scale. Experimental results demonstrate that PPFormer can outperform fully convolutional networks (FCN)- and attention-based semantic segmentation models on the PASCAL VOC 2012, ADE20k, and Cityscapes datasets.
-
-