An effective CNN and Transformer complementary network for medical image segmentation
TL;DR Summary
The CTC-Net, a complementary network for medical image segmentation, combines CNN's local features with Transformer’s long-range dependencies. It utilizes encoders and cross-domain fusion to enhance feature representation, outperforming existing models in organ and cardiac segmen
Abstract
The Transformer network was originally proposed for natural language processing. Due to its powerful representation ability for long-range dependency, it has been extended for vision tasks in recent years. To fully utilize the advantages of Transformers and Convolutional Neural Networks (CNNs), we propose a CNN and Transformer Complementary Network (CTC-Net) for medical image segmentation. We first design two encoders by Swin Transformers and Residual CNNs to produce complementary features in Transformer and CNN domains, respectively. Then we cross-wisely concatenate these complementary features to propose a Cross-domain Fusion Block (CFB) for effectively blending them. In addition, we compute the correlation between features from the CNN and Transformer domains, and apply channel attention to the self-attention features by Transformers for capturing dual attention information. We incorporate cross-domain fusion, feature correlation and dual attention together to propose a Feature Complementary Module (FCM) for improving the representation ability of features. Finally, we design a Swin Transformer decoder to further improve the representation ability of long-range dependencies, and propose to use skip connections between the Transformer decoded features and the complementary features for extracting spatial details, contextual semantics and long-range information. Skip connections are performed in different levels for enhancing multi-scale invariance. Experimental results show that our CTC-Net significantly surpasses the state-of-the-art image segmentation models based on CNNs, Transformers, and even Transformer and CNN combined models designed for medical image segmentation. It achieves superior performance on different medical applications, including multi-organ segmentation and cardiac segmentation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is the development of An effective CNN and Transformer complementary network for medical image segmentation.
1.2. Authors
The authors are:
-
Feiniu Yuan (Hefei University of Technology, University of Science and Technology of China, Singapore Bioimaging Consortium, Shanghai Normal University)
-
Zhengxiao Zhang (Shanghai Normal University)
-
Zhijun Fang (Donghua University)
Their research backgrounds primarily include deep learning, image segmentation, pattern recognition, and 3D modeling, with a focus on medical image processing.
1.3. Journal/Conference
The paper was published in a journal. The specific journal name is not explicitly stated in the provided text, but it is indicated to be published by Elsevier Ltd. The quality and influence of Elsevier journals are generally high in academic research, suggesting a peer-reviewed publication.
1.4. Publication Year
The paper was published on November 30, 2022.
1.5. Abstract
The paper proposes a CNN and Transformer Complementary Network (CTC-Net) for medical image segmentation, aiming to leverage the strengths of both Convolutional Neural Networks (CNNs) for local contextual information and Transformers for long-range dependencies. The CTC-Net features two encoders: one based on Swin Transformers for Transformer domain features and another on Residual CNNs for CNN domain features, ensuring complementary feature production. A Cross-domain Fusion Block (CFB) is designed to blend these features effectively, and a Feature Complementary Module (FCM) incorporates cross-domain correlation and dual attention to enhance feature representation. Finally, a Swin Transformer decoder with multi-level skip connections is used to improve long-range dependency modeling and extract spatial details and contextual semantics. Experimental results demonstrate that CTC-Net significantly outperforms state-of-the-art CNN-based, Transformer-based, and CNN-Transformer combined models across various medical applications, including multi-organ and cardiac segmentation.
1.6. Original Source Link
The original source link is: /files/papers/6929949b4241c84d8510f9f3/paper.pdf.
This indicates the paper is officially published and accessible via the provided link.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is accurate medical image segmentation, which is crucial for computer-aided clinical diagnosis and treatment planning. Medical images, often reflecting internal body structures, require precise pixel-level classification to locate specific organs or lesions.
Existing research faces several challenges:
-
CNNs' limitation in long-range dependency: While
Convolutional Neural Networks (CNNs)excel at extracting local contextual information due to their strong inductive biases (locality and translation invariance), theirreceptive fieldsare inherently limited. This makes modeling long-range dependencies across an image difficult, which is crucial for segmenting large or irregularly shaped organs. -
Transformers' limitation in local details:
Transformers, initially developed for natural language processing, utilizeself-attention mechanismsto effectively capture global and long-range dependencies. However, they lack the strong inductive biases ofCNNsfor locality and translation invariance, making them less effective at extracting fine-grained spatial details and local contextual features. -
Gaps in prior hybrid models: While some prior works attempted to combine
CNNsandTransformers, they often failed to fully exploit the advantages of both, such as neglecting to introduceTransformersat multiple feature scales or lacking robust cross-domain feature fusion mechanisms.The paper's entry point or innovative idea is the belief that
TransformersandCNNsare naturally complementary. By combining their strengths, it's possible to overcome their individual weaknesses and create a more robust model for medical image segmentation. This is addressed by designing a network that explicitly produces and fuses complementary features from both domains.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Dual Encoding Paths: Designing two distinct encoders—a
Residual CNNencoder (ResNet34) for spatial and contextual features and aSwin Transformerencoder for long-range dependencies—to produce mutually complementary features. -
Effective Feature Complementary Module (FCM): Proposing an
FCMthat cross-wisely fuses features fromCNNandTransformerdomains. This module incorporates aCross-domain Fusion Block (CFB)for blending, aCorrelation Enhancement Block (CEB)for modeling cross-domain correlation, and aChannel Attention Block (CAB)for dual attention onTransformerfeatures. -
Multi-scale Transformer Decoder with Skip Connections: Introducing a
Swin Transformerdecoder to further improve the representation of long-range dependencies, enhanced by multi-scaleskip connectionswith the complementary features from theFCMto restore spatial details and contextual semantics. -
Novel Network Architecture (
CTC-Net): Integrating these components into a novelCNN and Transformer Complementary Network (CTC-Net)specifically designed for medical image segmentation.The key conclusions and findings are:
-
CTC-Netsignificantly surpasses state-of-the-art image segmentation models, including pureCNN-based, pureTransformer-based, and existingCNN-Transformercombined models. -
It achieves superior performance on diverse medical applications, such as multi-organ segmentation (Synapse dataset) and cardiac segmentation (ACDC dataset).
-
The
FCMand the dual encoder architecture are critical for the network's high performance, as validated by ablation studies. The explicit fusion of complementary features (local details fromCNNsand global context fromTransformers) is shown to be highly effective.These findings solve the problem of achieving accurate segmentation by robustly capturing both local and global image information, leading to better delineation of various organs, including those with complex shapes or small sizes (e.g., pancreas).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following foundational concepts:
-
Medical Image Segmentation: This is the process of partitioning a medical image into multiple segments (sets of pixels) to identify and delineate specific structures or organs (e.g., liver, heart, tumor). It's a pixel-level classification task where each pixel is assigned to a specific category. Accurate segmentation is vital for diagnosis, surgical planning, and treatment assessment.
-
Convolutional Neural Networks (CNNs): A class of deep neural networks primarily used for analyzing visual imagery.
- Convolutional Layer: The core building block of a
CNN. It applies a learnablekernel(or filter) that slides over the input image (or feature map), performing dot products to extract local features. This process leveragesspatial localityandparameter sharing. - Inductive Biases:
CNNshave strong inductive biases:- Locality: Features are extracted from local regions of the input.
- Translation Invariance: The network can recognize a pattern regardless of where it appears in the image, due to shared kernels.
- Pooling Layer (Down-sampling): Reduces the spatial dimensions (width and height) of the feature maps, which helps in reducing computational complexity, controlling overfitting, and making the network more robust to small variations in feature positions. Common types include
max poolingandaverage pooling. - Receptive Field: The region in the input image that a particular neuron in a
CNNlayer "sees" or is influenced by. In deeper layers, neurons have larger receptive fields, allowing them to capture more abstract and global information, but it's still spatially limited. - U-Net Architecture: A widely used
CNNarchitecture for biomedical image segmentation, characterized by its U-shaped structure. It consists of acontracting path(encoder) to capture context and asymmetric expanding path(decoder) that enables precise localization.Skip connectionsdirectly pass information from the encoder to the decoder at corresponding levels, helping to recover spatial information lost during down-sampling.
- Convolutional Layer: The core building block of a
-
Transformers (in Vision): Originally for natural language processing,
Transformershave been adapted for computer vision.- Self-Attention Mechanism: The core component of a
Transformer. It allows the model to weigh the importance of different parts of the input sequence (or image patches) when processing a particular element. For an input sequence of vectors (or tokens),self-attentioncomputes three matrices:Query(),Key(), andValue(). $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the
attention scores(orcompatibility scores) between each query and all keys. - is a scaling factor (square root of the dimension of the keys) used to prevent the dot product from becoming too large, which could push the
softmaxfunction into regions with very small gradients. - normalizes the scores to produce
attention weights, indicating how much each input element contributes to the output. - The
attention weightsare then multiplied by theValuematrix to obtain the weighted sum, which forms the output of theself-attentionlayer.
- Long-range Dependencies:
Transformersexcel at capturing these because theself-attentionmechanism computes relationships between all pairs of input elements, regardless of their spatial distance. This contrasts withCNNswhere interactions are primarily local. - Vision Transformer (ViT): A seminal work that first applied the standard
Transformerdirectly to image classification. It divides an image into fixed-size patches, treats each patch as a token, and feeds these tokens into a standardTransformer encoder. - Swin Transformer: A hierarchical
Vision Transformerthat introducesshifted windowsto achieve greater efficiency and better capture local features, mimicking some inductive biases ofCNNs. It performsself-attentionwithin local, non-overlapping windows, and then shifts these windows in successive layers to allow for cross-window connections, thereby building global context. It also usespatch mergingfor hierarchical feature representation.
- Self-Attention Mechanism: The core component of a
-
Channel Attention: A mechanism, often used in
CNNs, that allows the network to selectively focus on relevant feature channels. It typically involves aggregating spatial information (e.g., viaGlobal Average PoolingandGlobal Max Pooling), passing it through a smallMulti-Layer Perceptron (MLP), and then using asigmoidactivation to generate channel-wise weights that are multiplied back with the original feature map. This emphasizes important channels and suppresses less relevant ones. -
Skip Connections: Direct connections that bypass one or more layers in a neural network. They help in mitigating the vanishing gradient problem, enabling deeper networks, and preserving fine-grained information that might otherwise be lost through successive transformations. In
U-Net, they connect encoder features to decoder features. -
Global Average Pooling (GAP): A pooling operation that calculates the average of each feature map across all its spatial dimensions, resulting in a single value per feature map. This effectively summarizes the global context of each channel.
3.2. Previous Works
The paper contextualizes its contribution by discussing prior work in three categories:
-
CNN based methods:
- Fully Convolutional Network (FCN) [2]: Pioneering work that adapted
CNNsforsemantic segmentation, enabling end-to-end pixel-wise classification. It replaced fully connected layers with convolutional layers to allow arbitrary input sizes and produced dense predictions. - U-Net [3]: A highly influential
CNNarchitecture for biomedical image segmentation, characterized by its U-shaped encoder-decoder structure and extensiveskip connections. It balances capturing context (encoder) and precise localization (decoder). - ResNet [17], DenseNet [19], HRNet [20]: Deeper and more efficient
CNNbackbones that improved feature extraction, often used withinU-Netlike structures (e.g.,Res-UNet [22]). - Attention U-Net [25]: Introduced
attention gatesintoU-Netto focus on relevant regions in the input, improving performance by emphasizing salient features.
- Fully Convolutional Network (FCN) [2]: Pioneering work that adapted
-
Transformer based methods:
- Transformer [29]: The original
Transformermodel, initially for machine translation, revolutionized NLP with itsself-attention mechanism. It achieved excellent performance without convolutions. - Vision Transformer (ViT) [9]: The first successful application of standard
Transformersto vision tasks. It treated image patches as tokens, demonstratingTransformers'potential as backbones for vision problems.ViT NoneandViT CUP(referring toViTwithCNNUp-samplingandPatchembeddingfromTransUnet) are variations mentioned as baselines. - Swin Transformer [11]: Addressed
ViT's quadratic computational complexity by introducingshifted windowsfor local attention andpatch mergingfor hierarchical feature representation, makingTransformersmore efficient and adaptable for dense prediction tasks like segmentation.
- Transformer [29]: The original
-
CNN and Transformer combined methods:
- TransUnet [40]: One of the first to combine
CNNsandTransformersfor medical image segmentation. It used aCNNencoder, aTransformerblock for modeling long-range dependencies, and aCNNdecoder. However, the paper notes its limitation in fully exploitingTransformersacross all feature scales. - TransFuse [41]: Proposed a dual-branch architecture with
CNNandTransformerencoders, followed byCNNandTransformerdecoders, and a feature fusion module. The paper criticizes it for not fully utilizingTransformersfor decoding and fusing low-level spatial details. - Other combinations: Methods that insert
self-attentionblocks intoCNNmodels [34,35] or useCNNsfor initial feature extraction beforeTransformers[36].
- TransUnet [40]: One of the first to combine
3.3. Technological Evolution
The evolution of semantic segmentation models for medical images has generally progressed through these stages:
- Early CNNs: Beginning with
FCNs,CNNsdemonstrated the power of end-to-end learning for pixel-level prediction. - U-Net Era: The
U-Netarchitecture became the de-facto standard for medical image segmentation due to its effectiveness with limited data,skip connectionsfor detail preservation, and symmetric encoder-decoder design. Many variants (e.g.,Res-UNet,Attention U-Net) emerged, focusing on deeper architectures, attention mechanisms, or improved skip connections. - Transformer's Entry into Vision: Inspired by
Transformers'success in NLP,ViTshowed their potential for image recognition by treating images as sequences of patches. - Efficient Transformers for Dense Prediction:
Swin Transformerand similar hierarchicalTransformersmadeTransformerarchitectures more suitable for dense prediction tasks by addressing computational complexity and incorporating hierarchical features. - Hybrid CNN-Transformer Models: Recognizing the complementary strengths and weaknesses, the current trend is to combine
CNNs(for local details and inductive biases) andTransformers(for global context and long-range dependencies) into hybrid architectures (e.g.,TransUnet,TransFuse). This paper fits into this evolution by proposing a more sophisticated and explicitly complementary fusion strategy.
3.4. Differentiation Analysis
Compared to the main methods in related work, especially TransUnet and TransFuse (which are also hybrid models), CTC-Net introduces several core differences and innovations:
-
Explicit Complementary Dual Encoders: Unlike
TransUnetwhich uses aCNNencoder followed by aTransformerblock, orTransFusewhich has parallelCNNandTransformerencoders,CTC-Netexplicitly emphasizes that itsResNet34(CNN) encoder andSwin Transformerencoder are designed to produce mutually complementary features—spatial/contextual details fromCNNand long-range dependencies fromTransformer. -
Sophisticated Feature Complementary Module (FCM): This is the key differentiator.
CTC-Netgoes beyond simple concatenation or basic cross-attention (as inTransUnetor someTransFuseparts) by proposing a multi-facetedFCMthat includes:- Cross-domain Fusion Block (CFB): This block performs a "cross-wise" concatenation and processing with
Swin Transformer Blocks, allowing each domain's features to be enhanced by the global context of the other. - Correlation Enhancement Block (CEB): Explicitly models cross-domain correlation using
point-wise multiplication, acting as a specialized attention mechanism to highlight mutually salient features. - Channel Attention Block (CAB): Applies
channel attentionspecifically toTransformerfeatures, creating adual attention(channel and self-attention) to improve feature robustness.
- Cross-domain Fusion Block (CFB): This block performs a "cross-wise" concatenation and processing with
-
Pure Swin Transformer Decoder with Multi-scale Skip Connections:
CTC-Netutilizes aSwin Transformerdecoder, ensuring that long-range dependency modeling is carried through the decoding process, whichTransUnet(using aCNNdecoder) andTransFuse(using aCNNand aTransformerdecoder, but with less emphasis on multi-scale fusing of low-level details byTransformerpath) do not fully achieve. The multi-scaleskip connectionsfrom theFCM(which itself has blended features) to theTransformerdecoder are specifically designed to restore both spatial details and long-range information effectively. -
Asymmetric Decoder Design: The ablation study shows that an asymmetric design with one
Swin Transformerdecoder outperforms a symmetric variant with two decoders (oneCNNand oneTransformer), suggesting that effective fusion of complementary features in the encoder path, coupled with a powerfulTransformerdecoder, is more crucial than independent decoding paths.In essence,
CTC-Net's innovation lies in its highly integrated and explicitly designed feature complementarity, fusion, and attention mechanisms within a hybridCNN-Transformerarchitecture, which enables a more robust capture of both local and global information.
4. Methodology
4.1. Principles
The core idea behind the CNN and Transformer Complementary Network (CTC-Net) is based on the principle that Convolutional Neural Networks (CNNs) and Transformers possess complementary strengths and weaknesses, especially for image segmentation.
-
CNNsare inherently good at extracting local features, contextual information, and maintaining spatial details due to their strong inductive biases of locality and translation invariance. However, their limitedreceptive fieldsmake them less effective at modeling long-range dependencies. -
Transformers, leveragingself-attention mechanisms, excel at capturing global context and long-range dependencies by establishing relationships between distant parts of an image. Conversely, they typically lack the strong inductive biases ofCNNsfor local feature extraction and spatial detail preservation.The theoretical basis is that by combining these two paradigms,
CTC-Netcan simultaneously benefit from the local precision ofCNNsand the global understanding ofTransformers. The intuition is that features extracted by aCNNencoder and aTransformerencoder will be different yet mutually enriching. Therefore, an effective mechanism to fuse thesecross-domainfeatures will lead to a more powerful and robust feature representation, ultimately improving segmentation accuracy, especially for complex medical images where both fine details and overall shape/context are critical. The proposedFeature Complementary Module (FCM)and theSwin Transformerdecoder are designed to achieve this synergistic integration.
4.2. Core Methodology In-depth (Layer by Layer)
The CTC-Net architecture is composed of four main branches: a CNN encoder, a Transformer encoder, a Feature Complementary Module (FCM), and a Transformer decoder.
The following figure (Figure 2 from the original paper) shows the overall architecture of the CTC-Net:
该图像是示意图,展示了CNN和Transformer互补网络(CTC-Net)的结构框架。图中包含多个模块,包括输入层、特征互补模块、多个Swin Transformer块及CNN编码器,描述了网络的不同级别特征融合和传递过程。通过跳跃连接和最终的输出层,图像可用于有效的医学图像分割。这一结构旨在提升特征表示能力,整合卷积神经网络与Transformer的优势。
Transformer decoder. (e) Swin Transformer Block (STB).
4.2.1. Overall Architecture Flow
- An input RGB image of size is fed into both the
CNN encoderand theTransformer encoderin parallel. - Both encoders produce multi-level feature maps. The
CNN encoder(Figure 2a) generates features primarily focused on spatial details and contextual semantics. TheTransformer encoder(Figure 2c) generates features primarily focused on long-range dependencies. - At corresponding levels, the feature maps from the
CNN encoderand theTransformer encoderare fed into theFeature Complementary Module (FCM)(Figure 2b). TheFCMprocesses and fuses thesecross-domainfeatures, producing enhanced complementary feature maps. - The
Transformer decoder(Figure 2d) takes the final feature map from theTransformer encoder(at the deepest level) as its initial input. - During the up-sampling process in the
Transformer decoder, multi-levelskip connectionsare established. Specifically, the enhanced complementary features from theFCMat each level are fed into the correspondingSwin Transformer Blockswithin the decoder, where they are fused with the up-sampled features from the deeper layers of the decoder. This fusion helps in restoring lost spatial details, contextual semantics, and long-range information. - Finally, the
Transformer decoderup-samples the features to the original input image size and generates the segmentation mask of size , where is the number of categories.
4.2.2. The Transformer Encoder
The Transformer encoder (Figure 2c) is constructed by stacking Swin Transformer Blocks (STB) (Figure 2e) and patch merging operations, following the Swin Transformer architecture [11]. The goal is to capture long-range dependencies efficiently.
-
Swin Transformer Block (STB): Each
STB(Figure 2e) consists of two successive sub-blocks:- The first sub-block contains
Layer Normalization (LN),Window based Multi-head Self Attention (W-MSA),Multi-Layer Perceptron (MLP), andresidual additions. - The second sub-block is similar but replaces
W-MSAwithShifted Window based MSA (SW-MSA).
W-MSAcomputesself-attentionwithin non-overlapping local windows, reducing computational complexity from quadratic to linear with respect to image size.SW-MSAshifts the windows between successive blocks, allowing information to flow across different windows and thereby building global interaction while maintaining efficiency.
- The first sub-block contains
-
Patch Merging: This operation down-samples feature maps hierarchically. It merges adjacent patches into a single larger patch by concatenating their features along the channel dimension. For example, if a feature map has a resolution of and channels, after
patch merging, its resolution becomes and its channel dimension becomes4C. This is analogous to pooling inCNNsfor aggregating contextual features and creating multi-scale representations. -
Four Levels of Transformer Encoder:
-
Level 1: Starts with a
patch embeddinglayer (which converts image patches into tokens of feature dimension ) followed by twoSwin Transformer Blocks. -
Levels 2, 3, 4: Each level begins with a
patch mergingoperation to down-sample the feature map, followed by twoSwin Transformer Blocksto extract long-range dependencies at that scale.Let the input RGB image have a size of . The outputs of the
Transformer encoderat the four levels are denoted by . Their sizes are:
-
-
:
-
:
-
:
-
:
The feature dimension for each token is , assuming a patch from the RGB image forms a token.
4.2.3. The CNN Encoder
The CNN encoder (Figure 2a) is built using four encoding blocks of ResNet34 [17] to extract contextual features and maintain spatial details. Each ResNet34 block performs a down-sampling operation by a rate of 2.
-
To ensure consistency with the
Transformer encoder's feature map sizes,Conv1xandConv2xblocks ofResNet34are used to down-sample features twice in Level 1. -
The channel dimension for Level 1 is set to 48, matching the
Transformer encoder.The
CNN encoderproduces three feature maps: . -
Level 1:
Conv1xandConv2xprocess the input to generate with a size of . -
Level 2: The
Conv3xblock processes to generate with a size of . -
Level 3: The
Conv4xblock processes to obtain with a size of .These three feature maps contain abundant spatial details and contextual semantics, complementing the long-range dependencies captured by the
Transformer encoder.
4.2.4. Feature Complementary Module (FCM)
The Feature Complementary Module (FCM) (Figure 2b and Figure 3) is designed to obtain mutually complementary information by effectively fusing features from the Transformer encoder () and the CNN encoder (). It consists of four blocks: Cross-domain Fusion Block (CFB), Correlation Enhancement Block (CEB), Channel Attention Block (CAB), and Feature Fusion Block (FFB).
The following figure (Figure 3 from the original paper) details the Feature Complementary Module (FCM):
该图像是一个示意图,展示了CNN与Transformer互补网络的结构,主要包括跨域融合块(CFB)、特征关联增强块(CEB)和通道注意力块(CAB)。图中使用了全局平均池化(GAP)和Hadamard积等操作,以强化特征融合与表示能力。该网络旨在提高医学图像分割的性能。
ones denote the output features from the Transformer encoder.
Let the 2D Transformer feature map have the size , and the 3D CNN feature map have the size .
4.2.4.1. Cross-domain Fusion Block (CFB)
The CFB is responsible for cross-wisely fusing and enhancing features from the Transformer and CNN domains.
-
It first applies
Global Average Pooling (GAP)on both and to generate two feature vectors of size . -
The
Transformerinput is concatenated with the globally pooled feature vector of theCNNinput along the first axis. This creates a larger 2D feature map of size . This is then fed into aSwin Transformer Block (STB)for fusion, producing , which is reshaped into its 3D version of size . -
Symmetrically, the
CNNinput is concatenated with the pooled feature vector of theTransformerinput to produce . This is processed by anotherSwin Transformer Blockto generate , then reshaped into its 3D version . -
Finally, the two cross-domain fused 3D feature maps ( and ) are concatenated and processed by a convolution to generate the final cross-domain fusion feature map with size .
The processing for
CFBis formulated as follows: Where:
- :
Global Average Poolingoperation. - : Concatenation operation.
- :
Swin Transformer Block. - : Reshaping operation.
- : Convolution operation.
- : Input feature map from
CNN encoderat level . - : Input feature map from
Transformer encoderat level . - : Concatenated feature maps before processing by
STB. - : Reshaped 3D feature maps after
STBprocessing. - : Final cross-domain fusion feature map from
CFBat level .
4.2.4.2. Correlation Enhancement Block (CEB)
The CEB models the cross-domain correlation between features from the Transformer () and CNN () encoders.
- The 2D
Transformerfeature map is reshaped to its 3D version, denoted as . - Then, is
point-wisely multipliedby to produce a cross-domain correlation feature map with size . Thispoint-wise multiplicationacts as a special attention mechanism, enhancing features that are salient in both domains and suppressing less important ones.
4.2.4.3. Channel Attention Block (CAB)
The CAB further enhances attention features. While the Swin Transformer Block already includes a self-attention mechanism for long-range dependency, CAB applies a channel attention mechanism [42] (commonly used in CNNs) to the Transformer features. This effectively creates a mixture of channel attention and self-attention, resulting in a dual attention feature map with size .
4.2.4.4. Feature Fusion Block (FFB)
The FFB combines the outputs of the CFB, CEB, and CAB.
-
The cross-domain feature map , the correlation feature map , and the
dual attentionfeature map are concatenated to obtain with size . -
This concatenated map is then processed using residual connections and a
CBRblock to generate the final output feature map for theFCM, with size .The processing for
FFBis formulated as: Where:
- : Concatenation operation.
- : Cross-domain fusion feature map from
CFB. - : Correlation feature map from
CEB. - :
Dual attentionfeature map fromCAB. - : Concatenated feature map.
- : Convolution operation.
- : A block consisting of
convolutions (Conv),batch normalization (BN), andrectified linear unit (ReLU)activation. It is used to fuse the concatenated features and reduce the number of parameters. - : Final output feature map from
FCMat level .
4.2.5. The Transformer Decoder
The Transformer decoder (Figure 2d) is designed to progressively recover feature maps and further improve the representation of long-range dependencies, using Swin Transformer Blocks and patch expanding operations.
-
Patch Expanding: This operation is the inverse of
patch merging. It up-samples feature maps by rearranging pixel features, typically followed by a convolutional layer to adjust channel dimensions. For example, if a feature map has resolution and4Cchannels,patch expandingreshapes it to with channels. -
Four Levels of Transformer Decoder:
-
Level 4 (deepest): Only uses a
patch expandingoperation to up-sample features at a rate of 2. It takes the output from theTransformer encoderas its input. -
Levels 3 and 2: At each of these levels, two
Swin Transformer Blocksare first used to fuse (1) thecross-domain enhanced feature map() from the correspondingFeature Complementary Module(viaskip connection) and (2) the up-sampled features () from the adjacent deeper level of the decoder. After fusion,patch expandingis applied to up-sample the fused feature map. -
Level 1 (shallowest): Also uses two
Swin Transformer Blocksfor feature fusion and extraction of long-range dependencies, incorporating the feature map from theFCMviaskip connection. -
Final Output Layer: A final
patch expandingblock with a rate of 4 is used to recover the spatial size of the 2D feature map to . A convolution adjusts its channel number to (the category number), and a reshaping operation converts the 2D map into a 3D feature map, which is the final segmentation output ofCTC-Net.The data processing in the
Transformer decodercan be briefly formulated as follows: Where:
-
-
: The level index of the decoder (e.g., for the third decoding level).
-
:
Swin Transformer Block. -
: Up-sampled features from the adjacent deeper level of the decoder.
-
: Cross-domain enhanced feature map from the
Feature Complementary Module (FCM)at level (viaskip connection). -
: Fused and processed features at level after passing through two
STBs. -
:
Patch Expandingblock. -
: Up-sampled features for the next shallower level (
k-1).These
skip connectionsare crucial for providing theTransformer decoderwith multi-scale complementary information (spatial details and long-range context) from theFCM, allowing it to effectively restore high-resolution segmentation masks.
5. Experimental Setup
5.1. Datasets
The authors evaluated CTC-Net on two widely used medical image datasets:
-
Synapse dataset (Synapse):
- Source & Characteristics: This dataset consists of 30
CT(Computed Tomography) scans of abdominal organs.CTscans use X-rays to create cross-sectional images of the body, providing detailed anatomical information. - Task: Multi-organ segmentation.
- Split: 18 cases for the training set and 12 cases for the test set, following the split used in
TransUnet [40]. - Data Structure: The dataset is composed of 2211 2D slices extracted from the 3D volumes.
- Categories: Segmentation is performed on 8 categories: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach.
- Why Chosen: It is a standard benchmark for multi-organ segmentation in
CTimages, presenting challenges such as varying organ sizes, complex boundaries, and potential deformations (e.g., pancreas).
- Source & Characteristics: This dataset consists of 30
-
Automatic Cardiac Diagnosis Challenge (ACDC) dataset:
- Source & Characteristics: This dataset contains
MRI(Magnetic Resonance Imaging) images from 100 different patients.MRIuses strong magnetic fields and radio waves to create detailed images of organs and soft tissues, often providing excellent contrast for cardiac structures. - Task: Cardiac segmentation, specifically for automated cardiac diagnosis.
- Split: 70 samples for training, 10 samples for validation, and 20 samples for testing.
- Categories: Segmentation of 3 cardiac structures: left ventricle (LV), right ventricle (RV), and myocardium (MYO).
- Why Chosen: It's a challenging dataset for cardiac segmentation due to the dynamic nature of the heart, different imaging planes, and inter-patient variability. It also tests the model's generalization ability across different image modalities (
MRIvs.CT).
- Source & Characteristics: This dataset contains
Data Sample Example:
As the paper describes, Synapse involves abdominal CT scans where slices depict organs like the liver, kidneys, and pancreas. An example CT slice would be a grayscale 2D image showing varying tissue densities, with organs appearing as distinct regions of different intensities. The corresponding ground truth would be a pixel-wise mask, where each pixel is labeled with its organ category (e.g., pixels belonging to the liver are labeled 'liver', and background pixels are labeled 'background'). Similarly, for ACDC, an MRI slice would show the heart in cross-section, with different cardiac chambers and the myocardium visible, and the ground truth would be masks delineating LV, RV, and MYO.
These datasets are effective for validating the method's performance because they represent common and challenging tasks in medical image segmentation, cover different anatomical regions (abdomen and heart), and involve different imaging modalities (CT and MRI).
5.2. Evaluation Metrics
The paper uses two common evaluation metrics for image segmentation: Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD).
5.2.1. Dice Similarity Coefficient (DSC)
- Conceptual Definition: The
Dice Similarity Coefficientis a statistical measure of spatial overlap between a predicted segmentation and its corresponding ground truth. It quantifies how similar two sets are, and in segmentation, it indicates the degree of overlap between the predicted mask and the true mask. A higherDSCvalue (closer to 1) means better overlap and thus more accurate segmentation. - Mathematical Formula: $ \mathrm{DSC} = \frac{2 |P \cap G|}{|P| + |G|} $
- Symbol Explanation:
- : The set of pixels belonging to the predicted segmentation.
- : The set of pixels belonging to the ground truth segmentation.
- : The number of common pixels (intersection) between the predicted and ground truth segmentations.
- : The total number of pixels in the predicted segmentation.
- : The total number of pixels in the ground truth segmentation.
5.2.2. Hausdorff Distance (HD)
- Conceptual Definition: The
Hausdorff Distancemeasures the maximum distance between any point in one set to the nearest point in the other set. In the context of segmentation, it quantifies the dissimilarity of two boundary contours. A lowerHDvalue (closer to 0) indicates that the boundaries of the predicted segmentation and the ground truth are very close, implying high quality segmentation boundaries. It is sensitive to outliers and small boundary errors. - Mathematical Formula: $ \begin{array} { l } { { \cal H } D ( P , G ) = \mathrm{max} [ \cal D ( P , G ) , \cal D ( G , P ) ] , } \ { { \cal D } ( P , G ) = \operatorname*{max}{p \in \cal P} \operatorname*{min}{g \in \cal G} | p - g | , } \end{array} $
- Symbol Explanation:
- : The
Hausdorff Distancebetween the predicted segmentation and the ground truth . - : The directed
Hausdorff Distancefrom set to set . It is the maximum distance from any point in to its closest point in . - : The directed
Hausdorff Distancefrom set to set . - : The maximum operation over all points in the set .
- : The minimum operation over all points in the set .
- : A coordinate vector of a pixel in the predicted segmentation set .
- : A coordinate vector of a pixel in the ground truth segmentation set .
- : The norm (Euclidean distance) between pixel and pixel .
- : The set of coordinate points representing the boundary of the segmentation prediction.
- : The set of coordinate points representing the boundary of the ground truth.
- : The
5.3. Baselines
The paper compares CTC-Net against a comprehensive set of state-of-the-art segmentation models, including pure CNN-based, pure Transformer-based, and CNN-Transformer combined models, all designed for medical image segmentation. These baselines are representative of the leading approaches in the field.
Pure CNN-based models:
TransClaw U-Net [43]R50 U-Net [3](U-Net with ResNet50 backbone)U-Net [3]DARR [44]VNet [45]ENet [46]Att-UNet [25]R50-DeeplabV3+ [47](DeeplabV3+ with ResNet50 backbone)ContextNet [48]FSSNet [49]R50 Att-Unet [34](Attention U-Net with ResNet50 backbone)DABNet [50]EDANet [51]FPENet [52]FastSCNN [53]CGNET [54]
Pure Transformer-based models:
VIT None [9](Vision Transformer without any CNN components)SwinUNet [6](U-Net like pure Transformer)
CNN and Transformer combined models:
VIT CUP [9](Vision Transformer with CNN Up-sampling and Patch embedding, likely referring to components fromTransUnet)R50 VIT CUP [9](ResNet50 backbone withVIT CUPcomponents)TransUNet [40](CNN encoder + Transformer block + CNN decoder)
5.4. Implementation Details
-
Software & Hardware: Implemented using Python 3.8 and Pytorch 1.7.1 on an Intel i9 PC with an Nvidia GTX 3090 (24GB memory).
-
Initialization:
Transformer encoderanddecoder: Initialized with pre-trainedSwin Transformerweights on ImageNet.CNN encoder: Initialized with pre-trainedResNet34weights.
-
Training Parameters:
- Batch size: 24
- Maximum iteration number: 13,950
- Optimizer:
SGD(Stochastic Gradient Descent) - Basic learning rate (
base_lr): 0.01 - Momentum: 0.99
- Weight decay: 3e-5
-
Learning Rate Schedule: The learning rate (
lr) decays over iterations: $ lr = \mathrm{base_lr} \cdot \left( 1 - \frac{\mathrm{iter_num}}{\mathrm{max_iterations}} \right)^{0.9} $ Where:- : The initial (basic) learning rate.
- : The current iteration index.
- : The total number of training iterations.
0.9: A power factor determining the decay curve.
-
Loss Function: A weighted sum of
cross-entropy loss() andDice loss(): $ L = (1 - \alpha) \ell_{ce} + \alpha \ell_{dice} $ Where:- : The overall loss.
- : The
cross-entropy loss, commonly used for classification tasks, measuring the dissimilarity between predicted probabilities and true labels. - : The
Dice loss, directly optimized for theDice Similarity Coefficient, which is beneficial for segmentation tasks, especially with imbalanced classes. - : An importance weight empirically set to 0.6, indicating that
Dice lossis weighted more heavily thancross-entropy loss.
-
Post-processing:
Median filteringis applied to the segmentation results to produce smoother output masks. This is motivated by the naturally smooth surfaces of human organs and aims to reduce noise in the predictions. -
Network Configuration (from Table 1): The following are the results from Table 1 of the original paper:
PARAMETERS Level 1 Level 2 Level 3 Level 4 Input size 224 × 224 resolution 56 × 56 28× 28 14 × 14 7× 7 Depth_encoder 2 2 18 2 Depth_decoder 1 2 2 2 Num_heads 3 6 12 24 Num_heads_FCM 3 6 12 N/A Input size: pixels.resolution: Feature map resolutions at different levels (e.g., Level 1 has ).Depth_encoder: Number ofSwin Transformer Blocksin eachTransformer encoderlevel (e.g., Level 3 has 18 blocks).Depth_decoder: Number ofSwin Transformer Blocksin eachTransformer decoderlevel (e.g., Level 1 has 1 block).Num_heads: Number of attention heads inTransformer encoderanddecoder(Multi-head Self-Attentionallows the model to jointly attend to information from different representation subspaces).Num_heads_FCM: Number of attention heads in theFeature Complementary Module (FCM).
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that CTC-Net consistently achieves superior performance compared to state-of-the-art segmentation models across different medical imaging applications and metrics.
6.1.1. Results on Synapse Dataset
The following are the results from Table 2 of the original paper:
| METHODS | Average | Aorta | Gallbladder | Kidney(L) | Kidney(R) | Liver | Pancreas | Spleen | Stomach |
|---|---|---|---|---|---|---|---|---|---|
| TransClaw U-Net [43] | 78.09 | 85.87 | 61.38 | 84.83 | 79.36 | 94.28 | 57.65 | 87.74 | 73.55 |
| R50 U-Net [3] | 74.68 | 87.74 | 63.66 | 80.60 | 78.19 | 93.74 | 56.90 | 85.87 | 74.16 |
| Net [3] | 76.85 | 89.07 | 69.72 | 77.77 | 68.60 | 93.43 | 53.98 | 86.67 | 75.58 |
| DARR [44] | 69.77 | 74.74 | 53.77 | 72.31 | 73.24 | 94.08 | 54.18 | 89.90 | 45.96 |
| VNet [45] | 68.81 | 75.34 | 51.87 | 77.10 | 80.75 | 87.84 | 40.04 | 80.56 | 56.98 |
| ENet [46] | 77.63 | 85.13 | 64.91 | 81.10 | 77.26 | 93.37 | 57.83 | 87.03 | 74.41 |
| Att-UNet[25] | 77.77 | 89.55 | 68.88 | 77.98 | 71.11 | 93.57 | 58.04 | 87.30 | 75.75 |
| R50-DeeplabV3+[47] | 75.73 | 86.18 | 60.42 | 81.18 | 75.27 | 92.86 | 51.06 | 88.69 | 70.19 |
| ContextNet [48] | 71.17 | 79.92 | 51.17 | 77.58 | 72.04 | 91.74 | 43.78 | 86.65 | 66.51 |
| FSSNet [49] | 74.59 | 82.87 | 64.06 | 78.03 | 69.63 | 92.52 | 53.10 | 85.65 | 70.86 |
| R50 Att-Unet [34] | 75.57 | 55.92 | 63.91 | 79.20 | 72.71 | 93.56 | 49.37 | 87.19 | 74.95 |
| DABNet [50] | 74.91 | 85.01 | 56.89 | 77.84 | 72.45 | 93.05 | 54.39 | 88.23 | 71.45 |
| EDANet [51] | 75.43 | 84.35 | 62.31 | 76.16 | 71.65 | 93.20 | 53.19 | 85.47 | 77.12 |
| FPENet [52] | 68.67 | 78.98 | 56.35 | 74.54 | 64.36 | 90.86 | 40.60 | 78.30 | 65.35 |
| FastSCNN [53] | 70.53 | 77.79 | 55.96 | 73.61 | 67.38 | 91.68 | 44.54 | 84.51 | 68.76 |
| VIT None [9] | 61.50 | 44.38 | 39.59 | 67.46 | 62.94 | 89.21 | 43.14 | 75.45 | 68.78 |
| VIT CUP [9] | 67.86 | 70.19 | 45.10 | 74.70 | 67.40 | 91.32 | 42.00 | 81.75 | 70.44 |
| R50 VIT CUP [9] | 71.29 | 73.73 | 55.13 | 65.32 | 75.80 | 72.20 | 91.51 | 45.99 | 81.99 |
| CGNET [54] | 75.08 | 83.48 | 63.16 | 77.91 | 77.02 | 91.92 | 57.37 | 85.47 | 77.12 |
| TransUNet [40] | 77.48 | 87.23 | 55.86 | 81.87 | 72.39 | 93.78 | 59.73 | 85.08 | 72.39 |
| CTC-Net(Ours) | 78.41 | 86.46 | 63.53 | 83.71 | 80.79 | 94.08 | 59.73 | 86.87 | 72.39 |
-
Overall Performance:
CTC-Netachieves the highest averageDSCof on the challenging Synapse dataset, outperforming all 20 compared methods. This includes strongCNN-based models likeTransClaw U-Net() andAtt-UNet(), and leadingCNN-Transformerhybrids likeTransUNet(). -
Organ-specific Performance:
- Pancreas: Historically a difficult organ to segment due to large deformations and blurred boundaries,
CTC-Netachieves the bestDSCof (tied withTransUNet), showcasing its ability to handle complex and variable structures by effectively combining local details and global interactions. - Kidney(R):
CTC-Netachieves the highestDSCof . - Kidney(L):
CTC-Netachieves the second highestDSCof . CTC-Netoutperforms other methods on at least half of the eight categories, demonstrating its overall robustness.
- Pancreas: Historically a difficult organ to segment due to large deformations and blurred boundaries,
-
Improvement over TransUNet: While the average
DSCimprovement overTransUNetis about , this is significant given the highly optimized nature of existing state-of-the-art models.The following are the results from Table 3 of the original paper:
METHODS HD↓ R50 U-Net [3] 36.87 U-Net [3] 39.70 Att-UNet[25] 36.02 R50 Att-Unet [34] 36.97 R50 VIT CUP [9] 32.87 TransUNet [40] 31.69 CTC-Net(Ours) 22.52 -
Hausdorff Distance (HD):
CTC-Netsignificantly reduces theHausdorff Distanceto22.52, which is a substantial improvement overTransUNet(31.69) and other strong baselines. A smallerHDindicates much more accurate segmentation boundaries, suggesting that the precise fusion of localCNNdetails and globalTransformercontext inCTC-Netleads to sharper and more faithful delineations of organ contours. This improvement is almost overTransUNet, highlighting the model's ability to produce robust feature representations for both large and irregularly shaped organs.The following figure (Figure 4 from the original paper) shows the visual comparison of different methods on Synapse datasets, further supporting the quantitative results:
该图像是一个示意图,展示了不同方法在 Synapse 数据集上的医学图像分割效果对比。第一行展示了真实标注(Ground Truth)及各算法的分割结果,包括 CTC-Net、Att-Unet、U-Net 和 TransUNet。各个颜色表示不同的器官,其中蓝色为主动脉,绿色为胆囊,红色为左肾,青色为右肾,粉色为肝脏,黄色为胰腺,白色为脾脏,灰色为胃。
Fig. 4. The visualized comparison of different methods on Synapse datasets.
6.1.2. Results on ACDC Dataset
The following are the results from Table 4 of the original paper:
| METHODS | Average | RV | MYO | LV |
|---|---|---|---|---|
| R50 U-Net [3] | 87.55 | 87.10 | 80.63 | 94.92 |
| R50 Att-Unet [34] | 86.75 | 87.58 | 79.20 | 93.47 |
| VIT CUP [9] | 81.45 | 81.46 | 70.71 | 92.18 |
| R50 VIT CUP [9] | 87.57 | 86.07 | 81.88 | 94.75 |
| TransUNet [40] | 89.71 | 88.86 | 84.54 | 95.73 |
| SwinUNet[6] | 90.00 | 88.55 | 85.62 | 95.83 |
| CTC-Net(Ours) | 90.77 | 90.09 | 85.52 | 96.72 |
- Generalization and Robustness: On the ACDC dataset, which uses
MRImodality for cardiac segmentation,CTC-Netagain achieves the highest averageDSCof . This demonstrates the model's strong generalization ability and robustness across different image modalities and body parts. - Cardiac Structure Performance:
- RV (Right Ventricle) and LV (Left Ventricle):
CTC-Netobtains the highestDSCfor both RV () and LV (). - MYO (Myocardium):
CTC-Netachieves the second highestDSCof , very close to the best performance (SwinUNetat ).
- RV (Right Ventricle) and LV (Left Ventricle):
- These results further validate that
CTC-Netoutperforms state-of-the-art methods in medical image segmentation, consistently achieving high accuracy for different applications.
6.2. Ablation Studies / Parameter Analysis
Ablation studies were conducted on the Synapse dataset to validate the rationality of CTC-Net and the effectiveness of its individual modules.
6.2.1. Evaluation of FCM
The following are the results from Table 5 of the original paper:
| Variants | Average | Aorta | Gallbladder | Kidney(L) | Kidney(R) | Liver | Pancreas | Spleen | Stomach |
|---|---|---|---|---|---|---|---|---|---|
| concat+conv | 75.52 | 85.58 | 60.46 | 78.86 | 73.88 | 93.23 | 51.24 | 86.68 | 74.25 |
| cross attention | 72.65 | 83.56 | 55.10 | 81.67 | 68.66 | 92.22 | 44.09 | 87.17 | 68.76 |
| Dual CAB | 72.70 | 82.78 | 53.78 | 76.86 | 69.08 | 91.79 | 51.68 | 85.15 | 70.48 |
| without CAB | 76.87 | 85.36 | 62.60 | 79.87 | 77.66 | 93.19 | 54.96 | 88.59 | 72.77 |
| without CFB | 75.83 | 85.91 | 61.11 | 85.76 | 79.57 | 93.51 | 48.17 | 86.67 | 65.99 |
| without CEB | 75.13 | 83.46 | 60.38 | 82.40 | 73.27 | 92.61 | 53.49 | 85.62 | 69.84 |
| CTC-Net (ours) | 78.41 | 86.46 | 63.53 | 83.71 | 80.79 | 93.78 | 59.73 | 86.87 | 72.39 |
-
(Average DSC: ): This variant uses a simple concatenation of CNN and Transformer features followed by a convolution. Its significantly lower performance compared to full
CTC-Net() highlights the importance of the sophisticated fusion mechanisms withinFCM. -
cross attention(Average DSC: ): ReplacingFCMwith aTransformerdecoder performing cross-attention (query from one encoder, key/value from the other, and vice versa) results in a much lowerDSC. This indicates thatFCM's specific multi-faceted design (CFB,CEB,CAB) for blending complementary features is more effective than a generic cross-attention approach. -
Dual CAB(Average DSC: ): Applyingchannel attentionto bothCNNandTransformerbranches, surprisingly, yields worse results thanCTC-Net. This suggests thatchannel attentionmight be redundant or even detrimental when applied indiscriminately, or that the specific mixture of attention types inCTC-Net(channel attention only onTransformerself-attention features) is optimized. -
without CAB(Average DSC: ): Removing theChannel Attention BlockfromFCMleads to a drop of in averageDSC. This confirms that theCABplays a crucial role in improving feature robustness by emphasizing important channel information, particularly for theTransformerfeatures. -
without CFB(Average DSC: ): Deleting theCross-domain Fusion Blockresults in a significant drop of in averageDSC. This underscores theCFB's importance in effectively blending features from the two distinct domains in a cross-wise manner, which is critical for leveraging their complementarity. -
without CEB(Average DSC: ): Removing theCorrelation Enhancement Blockleads to a drop in averageDSC. This demonstrates that explicitly modeling the cross-domain correlation betweenCNNandTransformerfeatures is vital for enhancing mutually salient information and improving accuracy.Overall, these ablation studies strongly confirm that the
Feature Complementary Module (FCM)and its individual components (CFB,CEB,CAB) are indispensable for the high performance ofCTC-Net. Each block contributes uniquely to fusing, enhancing, and refining the complementary features, leading to a robust representation.
6.2.2. Evaluation of Encoders
The following are the results from Table 6 of the original paper:
| Variants | Average | Aorta | Gallbladder | Kidney(L) | Kidney(R) | Liver | Pancreas | Spleen | Stomach |
|---|---|---|---|---|---|---|---|---|---|
| CTC-Net without CNNs | 76.38 | 83.54 | 63.93 | 80.73 | 76.98 | 93.27 | 55.71 | 84.54 | 72.32 |
| CTC-Net (ours) | 78.41 | 86.46 | 63.53 | 83.71 | 80.79 | 93.78 | 59.73 | 86.87 | 72.39 |
CTC-Net without CNNs(Average DSC: ): This variant effectively transformsCTC-Netinto a pureTransformerarchitecture (encoder and decoder composed ofSwin Transformer Blocks). The significant drop in averageDSC() compared to the fullCTC-Net() demonstrates the critical importance of theCNN encoder. Even though theTransformerencoder is the "major branch" for long-range dependencies, theCNN encoderis crucial for providing complementary contextual features and spatial details, which the pureTransformerarchitecture struggles to capture as effectively. The results on specific organs further show that the fullCTC-Netgenerally performs better across categories (6 out of 8).
6.2.3. Evaluation of Decoders
The following are the results from Table 7 of the original paper:
| VAriantS | Average | Aorta | Gallbladder | Kidney(L) | Kidney(R) | Liver | Pancreas | Spleen | Stomach |
|---|---|---|---|---|---|---|---|---|---|
| CTC-Net with two decoders | 69.68 | 73.81 | 56.85 | 73.71 | 66.74 | 89.55 | 47.44 | 83.02 | 66.35 |
| CTC-Net with cross attention | 76.73 | 85.46 | 60.36 | 83.91 | 77.41 | 93.23 | 52.96 | 86.35 | 74.39 |
| CTC-Net | 78.41 | 86.46 | 63.53 | 83.71 | 80.79 | 93.78 | 59.73 | 86.87 | 72.39 |
-
CTC-Net with two decoders(Average DSC: ): This variant adds a traditionalCNNdecoder parallel to theTransformerdecoder. Surprisingly, this symmetric design performs significantly worse than the asymmetricCTC-Netwith only oneTransformerdecoder. The authors attribute this to two reasons:- Increased network parameters leading to potential overfitting.
- Lack of adequate information interchange between the two independently recovering decoders. This highlights that simply adding more components without careful fusion can be detrimental.
-
CTC-Net with cross attention(Average DSC: ): This variant replaces theSwin Transformer Blocksin the decoder's skip connection fusion with across-attentionmechanism (query from skip connection, key/value from up-sampled features). While performing better than the "two decoders" variant, it still falls short ofCTC-Net. This indicates thatCTC-Net's approach of usingSwin Transformer Blocksfor fusion in the decoder, combined with the enriched features fromFCM, is more effective than a genericcross-attentionmechanism in that specific architectural position. TheSwin Transformer Blocksin the decoder likely offer a better balance of local and global context integration for up-sampling.These decoder ablation studies validate the efficiency and effectiveness of
CTC-Net's asymmetric design, emphasizing that the quality of feature fusion in theFCMand the power of a single, well-integratedTransformerdecoder are more crucial than simply adding more decoding paths.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces CTC-Net (CNN and Transformer Complementary Network), a novel deep learning architecture for medical image segmentation. The core idea is to harness the complementary strengths of CNNs for local detail extraction and Transformers for long-range dependency modeling. CTC-Net achieves this through:
-
Dual Encoders: Parallel
ResNet34(CNN) andSwin Transformerencoders produce distinct yet complementary features. -
Feature Complementary Module (FCM): This innovative module, comprising
Cross-domain Fusion Block (CFB),Correlation Enhancement Block (CEB), andChannel Attention Block (CAB), is designed for sophisticated, cross-wise fusion, correlation, and dual attention on these features. -
Swin Transformer Decoder: A powerful
Swin Transformerdecoder, augmented with multi-levelskip connectionsfrom theFCM, effectively recovers spatial details and long-range information to generate the final segmentation mask.Experimental results on the Synapse (multi-organ
CTsegmentation) and ACDC (cardiacMRIsegmentation) datasets demonstrateCTC-Net's superior performance, consistently outperforming state-of-the-artCNN-based,Transformer-based, and existingCNN-Transformerhybrid models. Notably,CTC-Netshows significant improvements inHausdorff Distance, indicating more accurate boundary delineation.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation of their method:
-
Boundary Detail Extraction: Despite achieving pleasing segmentation results,
CTC-Net's limitation lies in the extraction of boundary details. The main reason speculated is that both theCNNandTransformerencoders start recovering feature maps from down-sampled inputs, which inherently results in some loss of detailed spatial information at the outset. Even withskip connectionsand sophisticated fusion, this initial information loss can impact the precision of fine boundaries.Based on this limitation, the authors propose future work:
-
Novel networks without downsampling: They plan to explore new network architectures that avoid aggressive downsampling of feature maps, aiming to maintain high resolutions and abundant spatial details throughout the network to improve boundary accuracy.
7.3. Personal Insights & Critique
This paper provides a rigorous and effective approach to combining the strengths of CNNs and Transformers for medical image segmentation. The explicit focus on complementary features and the detailed design of the Feature Complementary Module (FCM) are highly insightful. Instead of merely stacking CNN and Transformer blocks, the FCM's multi-component design to perform cross-domain fusion, correlation, and dual attention represents a thoughtful strategy for true feature synergy.
One particularly interesting finding is the success of the asymmetric decoder design. The ablation study showing that "CTC-Net with two decoders" performs worse than the single Transformer decoder CTC-Net challenges the intuitive notion of symmetric encoder-decoder architectures often seen in U-Net variants. This suggests that a single, powerful Transformer decoder, effectively fed with pre-fused complementary features from the FCM via skip connections, is more efficient and robust than trying to recover features independently through separate CNN and Transformer decoding paths. This could be due to the Transformer decoder's inherent ability to handle global context, which might be more critical for up-sampling effectively once robustly fused features are provided.
The paper's identified limitation regarding boundary details, stemming from initial downsampling, is a common challenge in many deep learning segmentation models. While Swin Transformer uses patch embedding (which itself is a form of downsampling), and CNNs typically start with convolutions and pooling, the initial reduction to resolution means a significant portion of fine-grained information is already lost. Future work exploring networks that process images at higher resolutions throughout, or employing more sophisticated upsampling methods earlier in the pipeline, would indeed be valuable. This could involve techniques like progressive resizing, super-resolution modules, or attention mechanisms specifically tuned for boundary refinement.
The methods and conclusions from this paper could potentially be transferred or applied to other domains requiring precise object delineation where both local details and global context are important, such as:
-
Remote Sensing Image Segmentation: Identifying buildings, roads, or land cover in satellite imagery where objects can be small or span large areas.
-
Industrial Defect Detection: Locating tiny defects on large surfaces, requiring both fine local analysis and awareness of overall patterns.
-
Autonomous Driving: Segmenting complex urban scenes with diverse objects, requiring recognition of both individual vehicles/pedestrians and overall scene layout.
A potential area for improvement or further investigation could be a more detailed analysis of the
FCM's computational overhead and latency, especially given its multiple sub-blocks andSwin Transformer Blockswithin the fusion path. While performance is excellent, practical deployment often requires consideration of inference speed. Additionally, exploring how the weighting factor in the loss function impacts segmentation quality, particularly for boundary-sensitive tasks, could provide further insights into optimizing specific aspects of medical image segmentation.
Similar papers
Recommended via semantic vector search.