Paper status: completed

An effective CNN and Transformer complementary network for medical image segmentation

Published:11/30/2022

Medical Image Segmentation (1)CNN and Transformer Complementary Network (1)Cross-domain Fusion Block (1)Feature Complementary Module (1)Swin Transformer Decoder (1)

Original Link

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The CTC-Net, a complementary network for medical image segmentation, combines CNN's local features with Transformer’s long-range dependencies. It utilizes encoders and cross-domain fusion to enhance feature representation, outperforming existing models in organ and cardiac segmen

Abstract

The Transformer network was originally proposed for natural language processing. Due to its powerful representation ability for long-range dependency, it has been extended for vision tasks in recent years. To fully utilize the advantages of Transformers and Convolutional Neural Networks (CNNs), we propose a CNN and Transformer Complementary Network (CTC-Net) for medical image segmentation. We first design two encoders by Swin Transformers and Residual CNNs to produce complementary features in Transformer and CNN domains, respectively. Then we cross-wisely concatenate these complementary features to propose a Cross-domain Fusion Block (CFB) for effectively blending them. In addition, we compute the correlation between features from the CNN and Transformer domains, and apply channel attention to the self-attention features by Transformers for capturing dual attention information. We incorporate cross-domain fusion, feature correlation and dual attention together to propose a Feature Complementary Module (FCM) for improving the representation ability of features. Finally, we design a Swin Transformer decoder to further improve the representation ability of long-range dependencies, and propose to use skip connections between the Transformer decoded features and the complementary features for extracting spatial details, contextual semantics and long-range information. Skip connections are performed in different levels for enhancing multi-scale invariance. Experimental results show that our CTC-Net significantly surpasses the state-of-the-art image segmentation models based on CNNs, Transformers, and even Transformer and CNN combined models designed for medical image segmentation. It achieves superior performance on different medical applications, including multi-organ segmentation and cardiac segmentation.

Mind Map

In-depth Reading

English Analysis~31 min read · 44,441 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is the development of An effective CNN and Transformer complementary network for medical image segmentation.

1.2. Authors

The authors are:

Feiniu Yuan (Hefei University of Technology, University of Science and Technology of China, Singapore Bioimaging Consortium, Shanghai Normal University)
Zhengxiao Zhang (Shanghai Normal University)
Zhijun Fang (Donghua University)

Their research backgrounds primarily include deep learning, image segmentation, pattern recognition, and 3D modeling, with a focus on medical image processing.

1.3. Journal/Conference

The paper was published in a journal. The specific journal name is not explicitly stated in the provided text, but it is indicated to be published by Elsevier Ltd. The quality and influence of Elsevier journals are generally high in academic research, suggesting a peer-reviewed publication.

1.4. Publication Year

The paper was published on November 30, 2022.

1.5. Abstract

The paper proposes a CNN and Transformer Complementary Network (CTC-Net) for medical image segmentation, aiming to leverage the strengths of both Convolutional Neural Networks (CNNs) for local contextual information and Transformers for long-range dependencies. The CTC-Net features two encoders: one based on Swin Transformers for Transformer domain features and another on Residual CNNs for CNN domain features, ensuring complementary feature production. A Cross-domain Fusion Block (CFB) is designed to blend these features effectively, and a Feature Complementary Module (FCM) incorporates cross-domain correlation and dual attention to enhance feature representation. Finally, a Swin Transformer decoder with multi-level skip connections is used to improve long-range dependency modeling and extract spatial details and contextual semantics. Experimental results demonstrate that CTC-Net significantly outperforms state-of-the-art CNN-based, Transformer-based, and CNN-Transformer combined models across various medical applications, including multi-organ and cardiac segmentation.

1.6. Original Source Link

The original source link is: /files/papers/6929949b4241c84d8510f9f3/paper.pdf. This indicates the paper is officially published and accessible via the provided link.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is accurate medical image segmentation, which is crucial for computer-aided clinical diagnosis and treatment planning. Medical images, often reflecting internal body structures, require precise pixel-level classification to locate specific organs or lesions.

Existing research faces several challenges:

CNNs' limitation in long-range dependency: While Convolutional Neural Networks (CNNs) excel at extracting local contextual information due to their strong inductive biases (locality and translation invariance), their receptive fields are inherently limited. This makes modeling long-range dependencies across an image difficult, which is crucial for segmenting large or irregularly shaped organs.
Transformers' limitation in local details: Transformers, initially developed for natural language processing, utilize self-attention mechanisms to effectively capture global and long-range dependencies. However, they lack the strong inductive biases of CNNs for locality and translation invariance, making them less effective at extracting fine-grained spatial details and local contextual features.
Gaps in prior hybrid models: While some prior works attempted to combine CNNs and Transformers, they often failed to fully exploit the advantages of both, such as neglecting to introduce Transformers at multiple feature scales or lacking robust cross-domain feature fusion mechanisms.

The paper's entry point or innovative idea is the belief that Transformers and CNNs are naturally complementary. By combining their strengths, it's possible to overcome their individual weaknesses and create a more robust model for medical image segmentation. This is addressed by designing a network that explicitly produces and fuses complementary features from both domains.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Dual Encoding Paths: Designing two distinct encoders—a Residual CNN encoder (ResNet34) for spatial and contextual features and a Swin Transformer encoder for long-range dependencies—to produce mutually complementary features.
Effective Feature Complementary Module (FCM): Proposing an FCM that cross-wisely fuses features from CNN and Transformer domains. This module incorporates a Cross-domain Fusion Block (CFB) for blending, a Correlation Enhancement Block (CEB) for modeling cross-domain correlation, and a Channel Attention Block (CAB) for dual attention on Transformer features.
Multi-scale Transformer Decoder with Skip Connections: Introducing a Swin Transformer decoder to further improve the representation of long-range dependencies, enhanced by multi-scale skip connections with the complementary features from the FCM to restore spatial details and contextual semantics.
Novel Network Architecture (CTC-Net): Integrating these components into a novel CNN and Transformer Complementary Network (CTC-Net) specifically designed for medical image segmentation.

The key conclusions and findings are:

CTC-Net significantly surpasses state-of-the-art image segmentation models, including pure CNN-based, pure Transformer-based, and existing CNN-Transformer combined models.
It achieves superior performance on diverse medical applications, such as multi-organ segmentation (Synapse dataset) and cardiac segmentation (ACDC dataset).
The FCM and the dual encoder architecture are critical for the network's high performance, as validated by ablation studies. The explicit fusion of complementary features (local details from CNNs and global context from Transformers) is shown to be highly effective.

These findings solve the problem of achieving accurate segmentation by robustly capturing both local and global image information, leading to better delineation of various organs, including those with complex shapes or small sizes (e.g., pancreas).

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following foundational concepts:

Medical Image Segmentation: This is the process of partitioning a medical image into multiple segments (sets of pixels) to identify and delineate specific structures or organs (e.g., liver, heart, tumor). It's a pixel-level classification task where each pixel is assigned to a specific category. Accurate segmentation is vital for diagnosis, surgical planning, and treatment assessment.
Convolutional Neural Networks (CNNs): A class of deep neural networks primarily used for analyzing visual imagery.
- Convolutional Layer: The core building block of a CNN. It applies a learnable kernel (or filter) that slides over the input image (or feature map), performing dot products to extract local features. This process leverages spatial locality and parameter sharing.
- Inductive Biases: CNNs have strong inductive biases:
  - Locality: Features are extracted from local regions of the input.
  - Translation Invariance: The network can recognize a pattern regardless of where it appears in the image, due to shared kernels.
- Pooling Layer (Down-sampling): Reduces the spatial dimensions (width and height) of the feature maps, which helps in reducing computational complexity, controlling overfitting, and making the network more robust to small variations in feature positions. Common types include max pooling and average pooling.
- Receptive Field: The region in the input image that a particular neuron in a CNN layer "sees" or is influenced by. In deeper layers, neurons have larger receptive fields, allowing them to capture more abstract and global information, but it's still spatially limited.
- U-Net Architecture: A widely used CNN architecture for biomedical image segmentation, characterized by its U-shaped structure. It consists of a contracting path (encoder) to capture context and a symmetric expanding path (decoder) that enables precise localization. Skip connections directly pass information from the encoder to the decoder at corresponding levels, helping to recover spatial information lost during down-sampling.
Transformers (in Vision): Originally for natural language processing, Transformers have been adapted for computer vision.
- Self-Attention Mechanism: The core component of a Transformer. It allows the model to weigh the importance of different parts of the input sequence (or image patches) when processing a particular element. For an input sequence of vectors (or tokens), self-attention computes three matrices: Query ( $Q$ $Q$ ), Key ( $K$ $K$ ), and Value ( $V$ $V$ ). $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates the attention scores (or compatibility scores) between each query and all keys.
  - $\sqrt{d_k}$ is a scaling factor (square root of the dimension of the keys) used to prevent the dot product from becoming too large, which could push the softmax function into regions with very small gradients.
  - $\mathrm{softmax}$ normalizes the scores to produce attention weights, indicating how much each input element contributes to the output.
  - The attention weights are then multiplied by the Value matrix $V$ to obtain the weighted sum, which forms the output of the self-attention layer.
- Long-range Dependencies: Transformers excel at capturing these because the self-attention mechanism computes relationships between all pairs of input elements, regardless of their spatial distance. This contrasts with CNNs where interactions are primarily local.
- Vision Transformer (ViT): A seminal work that first applied the standard Transformer directly to image classification. It divides an image into fixed-size patches, treats each patch as a token, and feeds these tokens into a standard Transformer encoder.
- Swin Transformer: A hierarchical Vision Transformer that introduces shifted windows to achieve greater efficiency and better capture local features, mimicking some inductive biases of CNNs. It performs self-attention within local, non-overlapping windows, and then shifts these windows in successive layers to allow for cross-window connections, thereby building global context. It also uses patch merging for hierarchical feature representation.
Channel Attention: A mechanism, often used in CNNs, that allows the network to selectively focus on relevant feature channels. It typically involves aggregating spatial information (e.g., via Global Average Pooling and Global Max Pooling), passing it through a small Multi-Layer Perceptron (MLP), and then using a sigmoid activation to generate channel-wise weights that are multiplied back with the original feature map. This emphasizes important channels and suppresses less relevant ones.
Skip Connections: Direct connections that bypass one or more layers in a neural network. They help in mitigating the vanishing gradient problem, enabling deeper networks, and preserving fine-grained information that might otherwise be lost through successive transformations. In U-Net, they connect encoder features to decoder features.
Global Average Pooling (GAP): A pooling operation that calculates the average of each feature map across all its spatial dimensions, resulting in a single value per feature map. This effectively summarizes the global context of each channel.

3.2. Previous Works

The paper contextualizes its contribution by discussing prior work in three categories:

CNN based methods:
- Fully Convolutional Network (FCN) [2]: Pioneering work that adapted CNNs for semantic segmentation, enabling end-to-end pixel-wise classification. It replaced fully connected layers with convolutional layers to allow arbitrary input sizes and produced dense predictions.
- U-Net [3]: A highly influential CNN architecture for biomedical image segmentation, characterized by its U-shaped encoder-decoder structure and extensive skip connections. It balances capturing context (encoder) and precise localization (decoder).
- ResNet [17], DenseNet [19], HRNet [20]: Deeper and more efficient CNN backbones that improved feature extraction, often used within U-Net like structures (e.g., Res-UNet [22]).
- Attention U-Net [25]: Introduced attention gates into U-Net to focus on relevant regions in the input, improving performance by emphasizing salient features.
Transformer based methods:
- Transformer [29]: The original Transformer model, initially for machine translation, revolutionized NLP with its self-attention mechanism. It achieved excellent performance without convolutions.
- Vision Transformer (ViT) [9]: The first successful application of standard Transformers to vision tasks. It treated image patches as tokens, demonstrating Transformers' potential as backbones for vision problems. ViT None and ViT CUP (referring to ViT with CNN Up-sampling and Patch embedding from TransUnet) are variations mentioned as baselines.
- Swin Transformer [11]: Addressed ViT's quadratic computational complexity by introducing shifted windows for local attention and patch merging for hierarchical feature representation, making Transformers more efficient and adaptable for dense prediction tasks like segmentation.
CNN and Transformer combined methods:
- TransUnet [40]: One of the first to combine CNNs and Transformers for medical image segmentation. It used a CNN encoder, a Transformer block for modeling long-range dependencies, and a CNN decoder. However, the paper notes its limitation in fully exploiting Transformers across all feature scales.
- TransFuse [41]: Proposed a dual-branch architecture with CNN and Transformer encoders, followed by CNN and Transformer decoders, and a feature fusion module. The paper criticizes it for not fully utilizing Transformers for decoding and fusing low-level spatial details.
- Other combinations: Methods that insert self-attention blocks into CNN models [34,35] or use CNNs for initial feature extraction before Transformers [36].

3.3. Technological Evolution

The evolution of semantic segmentation models for medical images has generally progressed through these stages:

Early CNNs: Beginning with FCNs, CNNs demonstrated the power of end-to-end learning for pixel-level prediction.
U-Net Era: The U-Net architecture became the de-facto standard for medical image segmentation due to its effectiveness with limited data, skip connections for detail preservation, and symmetric encoder-decoder design. Many variants (e.g., Res-UNet, Attention U-Net) emerged, focusing on deeper architectures, attention mechanisms, or improved skip connections.
Transformer's Entry into Vision: Inspired by Transformers' success in NLP, ViT showed their potential for image recognition by treating images as sequences of patches.
Efficient Transformers for Dense Prediction: Swin Transformer and similar hierarchical Transformers made Transformer architectures more suitable for dense prediction tasks by addressing computational complexity and incorporating hierarchical features.
Hybrid CNN-Transformer Models: Recognizing the complementary strengths and weaknesses, the current trend is to combine CNNs (for local details and inductive biases) and Transformers (for global context and long-range dependencies) into hybrid architectures (e.g., TransUnet, TransFuse). This paper fits into this evolution by proposing a more sophisticated and explicitly complementary fusion strategy.

3.4. Differentiation Analysis

Compared to the main methods in related work, especially TransUnet and TransFuse (which are also hybrid models), CTC-Net introduces several core differences and innovations:

Explicit Complementary Dual Encoders: Unlike TransUnet which uses a CNN encoder followed by a Transformer block, or TransFuse which has parallel CNN and Transformer encoders, CTC-Net explicitly emphasizes that its ResNet34 (CNN) encoder and Swin Transformer encoder are designed to produce mutually complementary features—spatial/contextual details from CNN and long-range dependencies from Transformer.
Sophisticated Feature Complementary Module (FCM): This is the key differentiator. CTC-Net goes beyond simple concatenation or basic cross-attention (as in TransUnet or some TransFuse parts) by proposing a multi-faceted FCM that includes:
- Cross-domain Fusion Block (CFB): This block performs a "cross-wise" concatenation and processing with Swin Transformer Blocks, allowing each domain's features to be enhanced by the global context of the other.
- Correlation Enhancement Block (CEB): Explicitly models cross-domain correlation using point-wise multiplication, acting as a specialized attention mechanism to highlight mutually salient features.
- Channel Attention Block (CAB): Applies channel attention specifically to Transformer features, creating a dual attention (channel and self-attention) to improve feature robustness.
Pure Swin Transformer Decoder with Multi-scale Skip Connections: CTC-Net utilizes a Swin Transformer decoder, ensuring that long-range dependency modeling is carried through the decoding process, which TransUnet (using a CNN decoder) and TransFuse (using a CNN and a Transformer decoder, but with less emphasis on multi-scale fusing of low-level details by Transformer path) do not fully achieve. The multi-scale skip connections from the FCM (which itself has blended features) to the Transformer decoder are specifically designed to restore both spatial details and long-range information effectively.
Asymmetric Decoder Design: The ablation study shows that an asymmetric design with one Swin Transformer decoder outperforms a symmetric variant with two decoders (one CNN and one Transformer), suggesting that effective fusion of complementary features in the encoder path, coupled with a powerful Transformer decoder, is more crucial than independent decoding paths.

In essence, CTC-Net's innovation lies in its highly integrated and explicitly designed feature complementarity, fusion, and attention mechanisms within a hybrid CNN-Transformer architecture, which enables a more robust capture of both local and global information.

4. Methodology

4.1. Principles

The core idea behind the CNN and Transformer Complementary Network (CTC-Net) is based on the principle that Convolutional Neural Networks (CNNs) and Transformers possess complementary strengths and weaknesses, especially for image segmentation.

CNNs are inherently good at extracting local features, contextual information, and maintaining spatial details due to their strong inductive biases of locality and translation invariance. However, their limited receptive fields make them less effective at modeling long-range dependencies.
Transformers, leveraging self-attention mechanisms, excel at capturing global context and long-range dependencies by establishing relationships between distant parts of an image. Conversely, they typically lack the strong inductive biases of CNNs for local feature extraction and spatial detail preservation.

The theoretical basis is that by combining these two paradigms, CTC-Net can simultaneously benefit from the local precision of CNNs and the global understanding of Transformers. The intuition is that features extracted by a CNN encoder and a Transformer encoder will be different yet mutually enriching. Therefore, an effective mechanism to fuse these cross-domain features will lead to a more powerful and robust feature representation, ultimately improving segmentation accuracy, especially for complex medical images where both fine details and overall shape/context are critical. The proposed Feature Complementary Module (FCM) and the Swin Transformer decoder are designed to achieve this synergistic integration.

4.2. Core Methodology In-depth (Layer by Layer)

The CTC-Net architecture is composed of four main branches: a CNN encoder, a Transformer encoder, a Feature Complementary Module (FCM), and a Transformer decoder.

The following figure (Figure 2 from the original paper) shows the overall architecture of the CTC-Net:

该图像是示意图，展示了CNN和Transformer互补网络（CTC-Net）的结构框架。图中包含多个模块，包括输入层、特征互补模块、多个Swin Transformer块及CNN编码器，描述了网络的不同级别特征融合和传递过程。通过跳跃连接和最终的输出层，图像可用于有效的医学图像分割。这一结构旨在提升特征表示能力，整合卷积神经网络与Transformer的优势。

Transformer decoder. (e) Swin Transformer Block (STB).

4.2.1. Overall Architecture Flow

An input RGB image of size $H \times W \times 3$ is fed into both the CNN encoder and the Transformer encoder in parallel.
Both encoders produce multi-level feature maps. The CNN encoder (Figure 2a) generates features primarily focused on spatial details and contextual semantics. The Transformer encoder (Figure 2c) generates features primarily focused on long-range dependencies.
At corresponding levels, the feature maps from the CNN encoder and the Transformer encoder are fed into the Feature Complementary Module (FCM) (Figure 2b). The FCM processes and fuses these cross-domain features, producing enhanced complementary feature maps.
The Transformer decoder (Figure 2d) takes the final feature map from the Transformer encoder (at the deepest level) as its initial input.
During the up-sampling process in the Transformer decoder, multi-level skip connections are established. Specifically, the enhanced complementary features from the FCM at each level are fed into the corresponding Swin Transformer Blocks within the decoder, where they are fused with the up-sampled features from the deeper layers of the decoder. This fusion helps in restoring lost spatial details, contextual semantics, and long-range information.
Finally, the Transformer decoder up-samples the features to the original input image size and generates the segmentation mask of size $H \times W \times N$ , where $N$ is the number of categories.

4.2.2. The Transformer Encoder

The Transformer encoder (Figure 2c) is constructed by stacking Swin Transformer Blocks (STB) (Figure 2e) and patch merging operations, following the Swin Transformer architecture [11]. The goal is to capture long-range dependencies efficiently.

Swin Transformer Block (STB): Each STB (Figure 2e) consists of two successive sub-blocks:
1. The first sub-block contains Layer Normalization (LN), Window based Multi-head Self Attention (W-MSA), Multi-Layer Perceptron (MLP), and residual additions.
2. The second sub-block is similar but replaces W-MSA with Shifted Window based MSA (SW-MSA).
- W-MSA computes self-attention within non-overlapping local windows, reducing computational complexity from quadratic to linear with respect to image size.
- SW-MSA shifts the windows between successive blocks, allowing information to flow across different windows and thereby building global interaction while maintaining efficiency.
Patch Merging: This operation down-samples feature maps hierarchically. It merges adjacent $2 \times 2$ patches into a single larger patch by concatenating their features along the channel dimension. For example, if a feature map has a resolution of $h \times w$ and $C$ channels, after patch merging, its resolution becomes $(h/2) \times (w/2)$ and its channel dimension becomes 4C. This is analogous to pooling in CNNs for aggregating contextual features and creating multi-scale representations.
Four Levels of Transformer Encoder:
- Level 1: Starts with a patch embedding layer (which converts $4 \times 4$ image patches into tokens of feature dimension $C$ ) followed by two Swin Transformer Blocks.
- Levels 2, 3, 4: Each level begins with a patch merging operation to down-sample the feature map, followed by two Swin Transformer Blocks to extract long-range dependencies at that scale.
  
  Let the input RGB image $\pmb{x}$ have a size of $H \times W \times 3$ . The outputs of the Transformer encoder at the four levels are denoted by $\mathbf{g}_1, \mathbf{g}_2, \mathbf{g}_3, \text{ and } \pmb{g}_4$ . Their sizes are:
$\mathbf{g}_1$ : $(H/4 \times W/4) \times C$
$\mathbf{g}_2$ : $(H/8 \times W/8) \times 2C$
$\mathbf{g}_3$ : $(H/16 \times W/16) \times 4C$
$\pmb{g}_4$ : $(H/32 \times W/32) \times 8C$

The feature dimension $C$ for each token is $4 \times 4 \times 3 = 48$ , assuming a $4 \times 4$ patch from the RGB image forms a token.

4.2.3. The CNN Encoder

The CNN encoder (Figure 2a) is built using four encoding blocks of ResNet34 [17] to extract contextual features and maintain spatial details. Each ResNet34 block performs a down-sampling operation by a rate of 2.

To ensure consistency with the Transformer encoder's feature map sizes, Conv1x and Conv2x blocks of ResNet34 are used to down-sample features twice in Level 1.
The channel dimension $C$ for Level 1 is set to 48, matching the Transformer encoder.

The CNN encoder produces three feature maps: $\mathbf{f}_1, \mathbf{f}_2, \text{ and } \mathbf{f}_3$ .
Level 1: Conv1x and Conv2x process the input to generate $\mathbf{f}_1$ with a size of $H/4 \times W/4 \times C$ .
Level 2: The Conv3x block processes $\mathbf{f}_1$ to generate $\mathbf{f}_2$ with a size of $H/8 \times W/8 \times 2C$ .
Level 3: The Conv4x block processes $\mathbf{f}_2$ to obtain $\mathbf{f}_3$ with a size of $H/16 \times W/16 \times 4C$ .

These three feature maps contain abundant spatial details and contextual semantics, complementing the long-range dependencies captured by the Transformer encoder.

4.2.4. Feature Complementary Module (FCM)

The Feature Complementary Module (FCM) (Figure 2b and Figure 3) is designed to obtain mutually complementary information by effectively fusing features from the Transformer encoder ( $\pmb{g}_i$ ) and the CNN encoder ( $\pmb{f}_i$ ). It consists of four blocks: Cross-domain Fusion Block (CFB), Correlation Enhancement Block (CEB), Channel Attention Block (CAB), and Feature Fusion Block (FFB).

The following figure (Figure 3 from the original paper) details the Feature Complementary Module (FCM):

该图像是一个示意图，展示了CNN与Transformer互补网络的结构，主要包括跨域融合块（CFB）、特征关联增强块（CEB）和通道注意力块（CAB）。图中使用了全局平均池化（GAP）和Hadamard积等操作，以强化特征融合与表示能力。该网络旨在提高医学图像分割的性能。

ones denote the output features from the Transformer encoder.

Let the 2D Transformer feature map $\pmb{g}_i$ have the size $(h \times w) \times c$ , and the 3D CNN feature map $\pmb{f}_i$ have the size $h \times w \times c$ .

4.2.4.1. Cross-domain Fusion Block (CFB)

The CFB is responsible for cross-wisely fusing and enhancing features from the Transformer and CNN domains.

It first applies Global Average Pooling (GAP) on both $\pmb{g}_i$ and $\pmb{f}_i$ to generate two feature vectors of size $(1 \times 1) \times c$ .
The Transformer input $\pmb{g}_i$ is concatenated with the globally pooled feature vector of the CNN input $\pmb{f}_i$ along the first axis. This creates a larger 2D feature map $g_i^1$ of size $(h \times w + 1) \times c$ . This $g_i^1$ is then fed into a Swin Transformer Block (STB) for fusion, producing $g_i^2$ , which is reshaped into its 3D version $g_i^3$ of size $h \times w \times c$ .
Symmetrically, the CNN input $\pmb{f}_i$ is concatenated with the pooled feature vector of the Transformer input $\pmb{g}_i$ to produce $f_i^1$ . This $f_i^1$ is processed by another Swin Transformer Block to generate $f_i^2$ , then reshaped into its 3D version $f_i^3$ .
Finally, the two cross-domain fused 3D feature maps ( $g_i^3$ and $f_i^3$ ) are concatenated and processed by a $1 \times 1$ convolution to generate the final cross-domain fusion feature map $\pmb{s}_i$ with size $h \times w \times c$ .

The processing for CFB is formulated as follows: $\begin{array} { r l } & { g _ { i } ^ { 1 } = \mathrm{cat} ( \mathrm{GAP} ( f _ { i } ) , g _ { i } ) , } \\ & { f _ { i } ^ { 1 } = \mathrm{cat} ( \mathrm{GAP} ( g _ { i } ) , f _ { i } ) , } \\ & { g _ { i } ^ { 3 } = \mathrm{reshape} \big ( \mathrm{STB} \big ( g _ { i } ^ { 1 } \big ) \big ) , } \\ & { f _ { i } ^ { 3 } = \mathrm{reshape} \big ( \mathrm{STB} \big ( f _ { i } ^ { 1 } \big ) \big ) , } \\ & { s _ { i } = \mathrm{conv} \big ( \mathrm{cat} \big ( g _ { i } ^ { 3 } , f _ { i } ^ { 3 } \big ) \big ) , } \end{array}$ Where:

$\mathrm{GAP}$ : Global Average Pooling operation.
$\mathrm{cat}$ : Concatenation operation.
$\mathrm{STB}$ : Swin Transformer Block.
$\mathrm{reshape}$ : Reshaping operation.
$\mathrm{conv}$ : $1 \times 1$ Convolution operation.
$f_i$ : Input feature map from CNN encoder at level $i$ .
$g_i$ : Input feature map from Transformer encoder at level $i$ .
$g_i^1, f_i^1$ : Concatenated feature maps before processing by STB.
$g_i^3, f_i^3$ : Reshaped 3D feature maps after STB processing.
$s_i$ : Final cross-domain fusion feature map from CFB at level $i$ .

4.2.4.2. Correlation Enhancement Block (CEB)

The CEB models the cross-domain correlation between features from the Transformer ( $\pmb{g}_i$ ) and CNN ( $\pmb{f}_i$ ) encoders.

The 2D Transformer feature map $\pmb{g}_i$ is reshaped to its 3D version, denoted as $g_i^0$ .
Then, $g_i^0$ is point-wisely multiplied by $\pmb{f}_i$ to produce a cross-domain correlation feature map $\pmb{e}_i$ with size $h \times w \times c$ . This point-wise multiplication acts as a special attention mechanism, enhancing features that are salient in both domains and suppressing less important ones.

4.2.4.3. Channel Attention Block (CAB)

The CAB further enhances attention features. While the Swin Transformer Block already includes a self-attention mechanism for long-range dependency, CAB applies a channel attention mechanism [42] (commonly used in CNNs) to the Transformer features. This effectively creates a mixture of channel attention and self-attention, resulting in a dual attention feature map $\pmb{a}_i$ with size $h \times w \times c$ .

4.2.4.4. Feature Fusion Block (FFB)

The FFB combines the outputs of the CFB, CEB, and CAB.

The cross-domain feature map $\pmb{s}_i$ , the correlation feature map $\pmb{e}_i$ , and the dual attention feature map $\pmb{a}_i$ are concatenated to obtain $m_i^1$ with size $h \times w \times 3c$ .
This concatenated map is then processed using residual connections and a CBR block to generate the final output feature map $\pmb{m}_i$ for the FCM, with size $h \times w \times c$ .

The processing for FFB is formulated as: $\begin{array} { l } { { m _ { i } ^ { 1 } = \mathrm{cat} ( s _ { i } , e _ { i } , a _ { i } ) , } } \\ { { } } \\ { { m _ { i } = \mathrm{reshape} \bigl ( \mathrm{conv} \bigl ( m _ { i } ^ { 1 } \bigr ) + \mathrm{CBR} \bigl ( m _ { i } ^ { 1 } \bigr ) \bigr ) , } } \end{array}$ Where:

$\mathrm{cat}$ : Concatenation operation.
$s_i$ : Cross-domain fusion feature map from CFB.
$e_i$ : Correlation feature map from CEB.
$a_i$ : Dual attention feature map from CAB.
$m_i^1$ : Concatenated feature map.
$\mathrm{conv}$ : Convolution operation.
$\mathrm{CBR}$ : A block consisting of convolutions (Conv), batch normalization (BN), and rectified linear unit (ReLU) activation. It is used to fuse the concatenated features and reduce the number of parameters.
$m_i$ : Final output feature map from FCM at level $i$ .

4.2.5. The Transformer Decoder

The Transformer decoder (Figure 2d) is designed to progressively recover feature maps and further improve the representation of long-range dependencies, using Swin Transformer Blocks and patch expanding operations.

Patch Expanding: This operation is the inverse of patch merging. It up-samples feature maps by rearranging pixel features, typically followed by a convolutional layer to adjust channel dimensions. For example, if a feature map has resolution $(h/2) \times (w/2)$ and 4C channels, patch expanding reshapes it to $h \times w$ with $C$ channels.
Four Levels of Transformer Decoder:
- Level 4 (deepest): Only uses a patch expanding operation to up-sample features at a rate of 2. It takes the output $\pmb{g}_4$ from the Transformer encoder as its input.
- Levels 3 and 2: At each of these levels, two Swin Transformer Blocks are first used to fuse (1) the cross-domain enhanced feature map ( $\pmb{m}_k$ ) from the corresponding Feature Complementary Module (via skip connection) and (2) the up-sampled features ( $u_k$ ) from the adjacent deeper level of the decoder. After fusion, patch expanding is applied to up-sample the fused feature map.
- Level 1 (shallowest): Also uses two Swin Transformer Blocks for feature fusion and extraction of long-range dependencies, incorporating the $\pmb{m}_1$ feature map from the FCM via skip connection.
- Final Output Layer: A final patch expanding block with a rate of 4 is used to recover the spatial size of the 2D feature map to $H \times W$ . A $1 \times 1$ convolution adjusts its channel number to $N$ (the category number), and a reshaping operation converts the 2D map into a 3D feature map, which is the final segmentation output of CTC-Net.
  
  The data processing in the Transformer decoder can be briefly formulated as follows: $\begin{array} { r l } & { \nu _ { k } = \mathrm{STB} ( \mathrm{STB} ( u _ { k } , m _ { k } ) ) , } \\ & { u _ { k - 1 } = \mathrm{PE} ( \nu _ { k } ) , } \end{array}$ Where:
$k$ : The level index of the decoder (e.g., $k=3$ for the third decoding level).
$\mathrm{STB}$ : Swin Transformer Block.
$u_k$ : Up-sampled features from the adjacent deeper level of the decoder.
$m_k$ : Cross-domain enhanced feature map from the Feature Complementary Module (FCM) at level $k$ (via skip connection).
$\nu_k$ : Fused and processed features at level $k$ after passing through two STBs.
$\mathrm{PE}$ : Patch Expanding block.
$u_{k-1}$ : Up-sampled features for the next shallower level (k-1).

These skip connections are crucial for providing the Transformer decoder with multi-scale complementary information (spatial details and long-range context) from the FCM, allowing it to effectively restore high-resolution segmentation masks.

5. Experimental Setup

5.1. Datasets

The authors evaluated CTC-Net on two widely used medical image datasets:

Synapse dataset (Synapse):
- Source & Characteristics: This dataset consists of 30 CT (Computed Tomography) scans of abdominal organs. CT scans use X-rays to create cross-sectional images of the body, providing detailed anatomical information.
- Task: Multi-organ segmentation.
- Split: 18 cases for the training set and 12 cases for the test set, following the split used in TransUnet [40].
- Data Structure: The dataset is composed of 2211 2D slices extracted from the 3D volumes.
- Categories: Segmentation is performed on 8 categories: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach.
- Why Chosen: It is a standard benchmark for multi-organ segmentation in CT images, presenting challenges such as varying organ sizes, complex boundaries, and potential deformations (e.g., pancreas).
Automatic Cardiac Diagnosis Challenge (ACDC) dataset:
- Source & Characteristics: This dataset contains MRI (Magnetic Resonance Imaging) images from 100 different patients. MRI uses strong magnetic fields and radio waves to create detailed images of organs and soft tissues, often providing excellent contrast for cardiac structures.
- Task: Cardiac segmentation, specifically for automated cardiac diagnosis.
- Split: 70 samples for training, 10 samples for validation, and 20 samples for testing.
- Categories: Segmentation of 3 cardiac structures: left ventricle (LV), right ventricle (RV), and myocardium (MYO).
- Why Chosen: It's a challenging dataset for cardiac segmentation due to the dynamic nature of the heart, different imaging planes, and inter-patient variability. It also tests the model's generalization ability across different image modalities (MRI vs. CT).

Data Sample Example: As the paper describes, Synapse involves abdominal CT scans where slices depict organs like the liver, kidneys, and pancreas. An example CT slice would be a grayscale 2D image showing varying tissue densities, with organs appearing as distinct regions of different intensities. The corresponding ground truth would be a pixel-wise mask, where each pixel is labeled with its organ category (e.g., pixels belonging to the liver are labeled 'liver', and background pixels are labeled 'background'). Similarly, for ACDC, an MRI slice would show the heart in cross-section, with different cardiac chambers and the myocardium visible, and the ground truth would be masks delineating LV, RV, and MYO.

These datasets are effective for validating the method's performance because they represent common and challenging tasks in medical image segmentation, cover different anatomical regions (abdomen and heart), and involve different imaging modalities (CT and MRI).

5.2. Evaluation Metrics

The paper uses two common evaluation metrics for image segmentation: Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD).

5.2.1. Dice Similarity Coefficient (DSC)

Conceptual Definition: The Dice Similarity Coefficient is a statistical measure of spatial overlap between a predicted segmentation and its corresponding ground truth. It quantifies how similar two sets are, and in segmentation, it indicates the degree of overlap between the predicted mask and the true mask. A higher DSC value (closer to 1) means better overlap and thus more accurate segmentation.
Mathematical Formula: $ \mathrm{DSC} = \frac{2 |P \cap G|}{|P| + |G|} $
Symbol Explanation:
- $P$ : The set of pixels belonging to the predicted segmentation.
- $G$ : The set of pixels belonging to the ground truth segmentation.
- $|P \cap G|$ : The number of common pixels (intersection) between the predicted and ground truth segmentations.
- $|P|$ : The total number of pixels in the predicted segmentation.
- $|G|$ : The total number of pixels in the ground truth segmentation.

5.2.2. Hausdorff Distance (HD)

Conceptual Definition: The Hausdorff Distance measures the maximum distance between any point in one set to the nearest point in the other set. In the context of segmentation, it quantifies the dissimilarity of two boundary contours. A lower HD value (closer to 0) indicates that the boundaries of the predicted segmentation and the ground truth are very close, implying high quality segmentation boundaries. It is sensitive to outliers and small boundary errors.
Mathematical Formula: $ \begin{array} { l } { { \cal H } D ( P , G ) = \mathrm{max} [ \cal D ( P , G ) , \cal D ( G , P ) ] , } \ { { \cal D } ( P , G ) = \operatorname*{max}{p \in \cal P} \operatorname*{min}{g \in \cal G} | p - g | , } \end{array} $
Symbol Explanation:
- $\cal H D (P, G)$ : The Hausdorff Distance between the predicted segmentation $P$ and the ground truth $G$ .
- $\cal D (P, G)$ : The directed Hausdorff Distance from set $P$ to set $G$ . It is the maximum distance from any point in $P$ to its closest point in $G$ .
- $\cal D (G, P)$ : The directed Hausdorff Distance from set $G$ to set $P$ .
- $\operatorname*{max}_{p \in \cal P}$ : The maximum operation over all points $p$ in the set $\cal P$ .
- $\operatorname*{min}_{g \in \cal G}$ : The minimum operation over all points $g$ in the set $\cal G$ .
- $p$ : A coordinate vector of a pixel in the predicted segmentation set $\cal P$ .
- $g$ : A coordinate vector of a pixel in the ground truth segmentation set $\cal G$ .
- $\|p - g\|$ : The $l_2$ norm (Euclidean distance) between pixel $p$ and pixel $g$ .
- $\cal P$ : The set of coordinate points representing the boundary of the segmentation prediction.
- $\cal G$ : The set of coordinate points representing the boundary of the ground truth.

5.3. Baselines

The paper compares CTC-Net against a comprehensive set of state-of-the-art segmentation models, including pure CNN-based, pure Transformer-based, and CNN-Transformer combined models, all designed for medical image segmentation. These baselines are representative of the leading approaches in the field.

Pure CNN-based models:

TransClaw U-Net [43]
R50 U-Net [3] (U-Net with ResNet50 backbone)
U-Net [3]
DARR [44]
VNet [45]
ENet [46]
Att-UNet [25]
R50-DeeplabV3+ [47] (DeeplabV3+ with ResNet50 backbone)
ContextNet [48]
FSSNet [49]
R50 Att-Unet [34] (Attention U-Net with ResNet50 backbone)
DABNet [50]
EDANet [51]
FPENet [52]
FastSCNN [53]
CGNET [54]

Pure Transformer-based models:

VIT None [9] (Vision Transformer without any CNN components)
SwinUNet [6] (U-Net like pure Transformer)

CNN and Transformer combined models:

VIT CUP [9] (Vision Transformer with CNN Up-sampling and Patch embedding, likely referring to components from TransUnet)
R50 VIT CUP [9] (ResNet50 backbone with VIT CUP components)
TransUNet [40] (CNN encoder + Transformer block + CNN decoder)

5.4. Implementation Details

Software & Hardware: Implemented using Python 3.8 and Pytorch 1.7.1 on an Intel i9 PC with an Nvidia GTX 3090 (24GB memory).
Initialization:
- Transformer encoder and decoder: Initialized with pre-trained Swin Transformer weights on ImageNet.
- CNN encoder: Initialized with pre-trained ResNet34 weights.
Training Parameters:
- Batch size: 24
- Maximum iteration number: 13,950
- Optimizer: SGD (Stochastic Gradient Descent)
- Basic learning rate (base_lr): 0.01
- Momentum: 0.99
- Weight decay: 3e-5
Learning Rate Schedule: The learning rate (lr) decays over iterations: $ lr = \mathrm{base_lr} \cdot \left( 1 - \frac{\mathrm{iter_num}}{\mathrm{max_iterations}} \right)^{0.9} $ Where:
- $\mathrm{base\_lr}$ : The initial (basic) learning rate.
- $\mathrm{iter\_num}$ : The current iteration index.
- $\mathrm{max\_iterations}$ : The total number of training iterations.
- 0.9: A power factor determining the decay curve.
Loss Function: A weighted sum of cross-entropy loss ( $\ell_{ce}$ ) and Dice loss ( $\ell_{dice}$ ): $ L = (1 - \alpha) \ell_{ce} + \alpha \ell_{dice} $ Where:
- $L$ : The overall loss.
- $\ell_{ce}$ : The cross-entropy loss, commonly used for classification tasks, measuring the dissimilarity between predicted probabilities and true labels.
- $\ell_{dice}$ : The Dice loss, directly optimized for the Dice Similarity Coefficient, which is beneficial for segmentation tasks, especially with imbalanced classes.
- $\alpha$ : An importance weight empirically set to 0.6, indicating that Dice loss is weighted more heavily than cross-entropy loss.
Post-processing: Median filtering is applied to the segmentation results to produce smoother output masks. This is motivated by the naturally smooth surfaces of human organs and aims to reduce noise in the predictions.

Network Configuration (from Table 1): The following are the results from Table 1 of the original paper:

PARAMETERS	Level 1	Level 2	Level 3	Level 4
Input size	224 × 224
resolution	56 × 56	28× 28	14 × 14	7× 7
Depth_encoder	2	2	18	2
Depth_decoder	1	2	2	2
Num_heads	3	6	12	24
Num_heads_FCM	3	6	12	N/A

Input size: $224 \times 224$ pixels.
resolution: Feature map resolutions at different levels (e.g., Level 1 has $56 \times 56$ ).
Depth_encoder: Number of Swin Transformer Blocks in each Transformer encoder level (e.g., Level 3 has 18 blocks).
Depth_decoder: Number of Swin Transformer Blocks in each Transformer decoder level (e.g., Level 1 has 1 block).
Num_heads: Number of attention heads in Transformer encoder and decoder (Multi-head Self-Attention allows the model to jointly attend to information from different representation subspaces).
Num_heads_FCM: Number of attention heads in the Feature Complementary Module (FCM).

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that CTC-Net consistently achieves superior performance compared to state-of-the-art segmentation models across different medical imaging applications and metrics.

6.1.1. Results on Synapse Dataset

The following are the results from Table 2 of the original paper:

METHODS	Average	Aorta	Gallbladder	Kidney(L)	Kidney(R)	Liver	Pancreas	Spleen	Stomach
TransClaw U-Net [43]	78.09	85.87	61.38	84.83	79.36	94.28	57.65	87.74	73.55
R50 U-Net [3]	74.68	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16
Net [3]	76.85	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58
DARR [44]	69.77	74.74	53.77	72.31	73.24	94.08	54.18	89.90	45.96
VNet [45]	68.81	75.34	51.87	77.10	80.75	87.84	40.04	80.56	56.98
ENet [46]	77.63	85.13	64.91	81.10	77.26	93.37	57.83	87.03	74.41
Att-UNet[25]	77.77	89.55	68.88	77.98	71.11	93.57	58.04	87.30	75.75
R50-DeeplabV3+[47]	75.73	86.18	60.42	81.18	75.27	92.86	51.06	88.69	70.19
ContextNet [48]	71.17	79.92	51.17	77.58	72.04	91.74	43.78	86.65	66.51
FSSNet [49]	74.59	82.87	64.06	78.03	69.63	92.52	53.10	85.65	70.86
R50 Att-Unet [34]	75.57	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
DABNet [50]	74.91	85.01	56.89	77.84	72.45	93.05	54.39	88.23	71.45
EDANet [51]	75.43	84.35	62.31	76.16	71.65	93.20	53.19	85.47	77.12
FPENet [52]	68.67	78.98	56.35	74.54	64.36	90.86	40.60	78.30	65.35
FastSCNN [53]	70.53	77.79	55.96	73.61	67.38	91.68	44.54	84.51	68.76
VIT None [9]	61.50	44.38	39.59	67.46	62.94	89.21	43.14	75.45	68.78
VIT CUP [9]	67.86	70.19	45.10	74.70	67.40	91.32	42.00	81.75	70.44
R50 VIT CUP [9]	71.29	73.73	55.13	65.32	75.80	72.20	91.51	45.99	81.99
CGNET [54]	75.08	83.48	63.16	77.91	77.02	91.92	57.37	85.47	77.12
TransUNet [40]	77.48	87.23	55.86	81.87	72.39	93.78	59.73	85.08	72.39
CTC-Net(Ours)	78.41	86.46	63.53	83.71	80.79	94.08	59.73	86.87	72.39

Overall Performance: CTC-Net achieves the highest average DSC of $78.41\%$ on the challenging Synapse dataset, outperforming all 20 compared methods. This includes strong CNN-based models like TransClaw U-Net ( $78.09\%$ ) and Att-UNet ( $77.77\%$ ), and leading CNN-Transformer hybrids like TransUNet ( $77.48\%$ ).
Organ-specific Performance:
- Pancreas: Historically a difficult organ to segment due to large deformations and blurred boundaries, CTC-Net achieves the best DSC of $59.73\%$ (tied with TransUNet), showcasing its ability to handle complex and variable structures by effectively combining local details and global interactions.
- Kidney(R): CTC-Net achieves the highest DSC of $80.79\%$ .
- Kidney(L): CTC-Net achieves the second highest DSC of $83.71\%$ .
- CTC-Net outperforms other methods on at least half of the eight categories, demonstrating its overall robustness.
Improvement over TransUNet: While the average DSC improvement over TransUNet is about $1\%$ , this is significant given the highly optimized nature of existing state-of-the-art models.

The following are the results from Table 3 of the original paper:

METHODS HD↓

R50 U-Net [3] 36.87

U-Net [3] 39.70

Att-UNet[25] 36.02

R50 Att-Unet [34] 36.97

R50 VIT CUP [9] 32.87

TransUNet [40] 31.69

CTC-Net(Ours) 22.52
Hausdorff Distance (HD): CTC-Net significantly reduces the Hausdorff Distance to 22.52, which is a substantial improvement over TransUNet (31.69) and other strong baselines. A smaller HD indicates much more accurate segmentation boundaries, suggesting that the precise fusion of local CNN details and global Transformer context in CTC-Net leads to sharper and more faithful delineations of organ contours. This improvement is almost $10\%$ over TransUNet, highlighting the model's ability to produce robust feature representations for both large and irregularly shaped organs.

The following figure (Figure 4 from the original paper) shows the visual comparison of different methods on Synapse datasets, further supporting the quantitative results:

该图像是一个示意图，展示了不同方法在 Synapse 数据集上的医学图像分割效果对比。第一行展示了真实标注（Ground Truth）及各算法的分割结果，包括 CTC-Net、Att-Unet、U-Net 和 TransUNet。各个颜色表示不同的器官，其中蓝色为主动脉，绿色为胆囊，红色为左肾，青色为右肾，粉色为肝脏，黄色为胰腺，白色为脾脏，灰色为胃。

METHODS	HD↓
R50 U-Net [3]	36.87
U-Net [3]	39.70
Att-UNet[25]	36.02
R50 Att-Unet [34]	36.97
R50 VIT CUP [9]	32.87
TransUNet [40]	31.69
CTC-Net(Ours)	22.52

Fig. 4. The visualized comparison of different methods on Synapse datasets.

6.1.2. Results on ACDC Dataset

The following are the results from Table 4 of the original paper:

METHODS	Average	RV	MYO	LV
R50 U-Net [3]	87.55	87.10	80.63	94.92
R50 Att-Unet [34]	86.75	87.58	79.20	93.47
VIT CUP [9]	81.45	81.46	70.71	92.18
R50 VIT CUP [9]	87.57	86.07	81.88	94.75
TransUNet [40]	89.71	88.86	84.54	95.73
SwinUNet[6]	90.00	88.55	85.62	95.83
CTC-Net(Ours)	90.77	90.09	85.52	96.72

Generalization and Robustness: On the ACDC dataset, which uses MRI modality for cardiac segmentation, CTC-Net again achieves the highest average DSC of $90.77\%$ . This demonstrates the model's strong generalization ability and robustness across different image modalities and body parts.
Cardiac Structure Performance:
- RV (Right Ventricle) and LV (Left Ventricle): CTC-Net obtains the highest DSC for both RV ( $90.09\%$ ) and LV ( $96.72\%$ ).
- MYO (Myocardium): CTC-Net achieves the second highest DSC of $85.52\%$ , very close to the best performance (SwinUNet at $85.62\%$ ).
These results further validate that CTC-Net outperforms state-of-the-art methods in medical image segmentation, consistently achieving high accuracy for different applications.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted on the Synapse dataset to validate the rationality of CTC-Net and the effectiveness of its individual modules.

6.2.1. Evaluation of FCM

The following are the results from Table 5 of the original paper:

Variants	Average	Aorta	Gallbladder	Kidney(L)	Kidney(R)	Liver	Pancreas	Spleen	Stomach
concat+conv	75.52	85.58	60.46	78.86	73.88	93.23	51.24	86.68	74.25
cross attention	72.65	83.56	55.10	81.67	68.66	92.22	44.09	87.17	68.76
Dual CAB	72.70	82.78	53.78	76.86	69.08	91.79	51.68	85.15	70.48
without CAB	76.87	85.36	62.60	79.87	77.66	93.19	54.96	88.59	72.77
without CFB	75.83	85.91	61.11	85.76	79.57	93.51	48.17	86.67	65.99
without CEB	75.13	83.46	60.38	82.40	73.27	92.61	53.49	85.62	69.84
CTC-Net (ours)	78.41	86.46	63.53	83.71	80.79	93.78	59.73	86.87	72.39

$concat+conv$ (Average DSC: $75.52\%$ ): This variant uses a simple concatenation of CNN and Transformer features followed by a $1 \times 1$ convolution. Its significantly lower performance compared to full CTC-Net ( $78.41\%$ ) highlights the importance of the sophisticated fusion mechanisms within FCM.
cross attention (Average DSC: $72.65\%$ ): Replacing FCM with a Transformer decoder performing cross-attention (query from one encoder, key/value from the other, and vice versa) results in a much lower DSC. This indicates that FCM's specific multi-faceted design (CFB, CEB, CAB) for blending complementary features is more effective than a generic cross-attention approach.
Dual CAB (Average DSC: $72.70\%$ ): Applying channel attention to both CNN and Transformer branches, surprisingly, yields worse results than CTC-Net. This suggests that channel attention might be redundant or even detrimental when applied indiscriminately, or that the specific mixture of attention types in CTC-Net (channel attention only on Transformer self-attention features) is optimized.
without CAB (Average DSC: $76.87\%$ ): Removing the Channel Attention Block from FCM leads to a drop of $1.54\%$ in average DSC. This confirms that the CAB plays a crucial role in improving feature robustness by emphasizing important channel information, particularly for the Transformer features.
without CFB (Average DSC: $75.83\%$ ): Deleting the Cross-domain Fusion Block results in a significant drop of $2.58\%$ in average DSC. This underscores the CFB's importance in effectively blending features from the two distinct domains in a cross-wise manner, which is critical for leveraging their complementarity.
without CEB (Average DSC: $75.13\%$ ): Removing the Correlation Enhancement Block leads to a $3.28\%$ drop in average DSC. This demonstrates that explicitly modeling the cross-domain correlation between CNN and Transformer features is vital for enhancing mutually salient information and improving accuracy.

Overall, these ablation studies strongly confirm that the Feature Complementary Module (FCM) and its individual components (CFB, CEB, CAB) are indispensable for the high performance of CTC-Net. Each block contributes uniquely to fusing, enhancing, and refining the complementary features, leading to a robust representation.

6.2.2. Evaluation of Encoders

The following are the results from Table 6 of the original paper:

Variants	Average	Aorta	Gallbladder	Kidney(L)	Kidney(R)	Liver	Pancreas	Spleen	Stomach
CTC-Net without CNNs	76.38	83.54	63.93	80.73	76.98	93.27	55.71	84.54	72.32
CTC-Net (ours)	78.41	86.46	63.53	83.71	80.79	93.78	59.73	86.87	72.39

CTC-Net without CNNs (Average DSC: $76.38\%$ ): This variant effectively transforms CTC-Net into a pure Transformer architecture (encoder and decoder composed of Swin Transformer Blocks). The significant drop in average DSC ( $2.03\%$ ) compared to the full CTC-Net ( $78.41\%$ ) demonstrates the critical importance of the CNN encoder. Even though the Transformer encoder is the "major branch" for long-range dependencies, the CNN encoder is crucial for providing complementary contextual features and spatial details, which the pure Transformer architecture struggles to capture as effectively. The results on specific organs further show that the full CTC-Net generally performs better across categories (6 out of 8).

6.2.3. Evaluation of Decoders

The following are the results from Table 7 of the original paper:

VAriantS	Average	Aorta	Gallbladder	Kidney(L)	Kidney(R)	Liver	Pancreas	Spleen	Stomach
CTC-Net with two decoders	69.68	73.81	56.85	73.71	66.74	89.55	47.44	83.02	66.35
CTC-Net with cross attention	76.73	85.46	60.36	83.91	77.41	93.23	52.96	86.35	74.39
CTC-Net	78.41	86.46	63.53	83.71	80.79	93.78	59.73	86.87	72.39

CTC-Net with two decoders (Average DSC: $69.68\%$ ): This variant adds a traditional CNN decoder parallel to the Transformer decoder. Surprisingly, this symmetric design performs significantly worse than the asymmetric CTC-Net with only one Transformer decoder. The authors attribute this to two reasons:
1. Increased network parameters leading to potential overfitting.
2. Lack of adequate information interchange between the two independently recovering decoders. This highlights that simply adding more components without careful fusion can be detrimental.
CTC-Net with cross attention (Average DSC: $76.73\%$ ): This variant replaces the Swin Transformer Blocks in the decoder's skip connection fusion with a cross-attention mechanism (query from skip connection, key/value from up-sampled features). While performing better than the "two decoders" variant, it still falls short of CTC-Net. This indicates that CTC-Net's approach of using Swin Transformer Blocks for fusion in the decoder, combined with the enriched features from FCM, is more effective than a generic cross-attention mechanism in that specific architectural position. The Swin Transformer Blocks in the decoder likely offer a better balance of local and global context integration for up-sampling.

These decoder ablation studies validate the efficiency and effectiveness of CTC-Net's asymmetric design, emphasizing that the quality of feature fusion in the FCM and the power of a single, well-integrated Transformer decoder are more crucial than simply adding more decoding paths.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces CTC-Net (CNN and Transformer Complementary Network), a novel deep learning architecture for medical image segmentation. The core idea is to harness the complementary strengths of CNNs for local detail extraction and Transformers for long-range dependency modeling. CTC-Net achieves this through:

Dual Encoders: Parallel ResNet34 (CNN) and Swin Transformer encoders produce distinct yet complementary features.
Feature Complementary Module (FCM): This innovative module, comprising Cross-domain Fusion Block (CFB), Correlation Enhancement Block (CEB), and Channel Attention Block (CAB), is designed for sophisticated, cross-wise fusion, correlation, and dual attention on these features.
Swin Transformer Decoder: A powerful Swin Transformer decoder, augmented with multi-level skip connections from the FCM, effectively recovers spatial details and long-range information to generate the final segmentation mask.

Experimental results on the Synapse (multi-organ CT segmentation) and ACDC (cardiac MRI segmentation) datasets demonstrate CTC-Net's superior performance, consistently outperforming state-of-the-art CNN-based, Transformer-based, and existing CNN-Transformer hybrid models. Notably, CTC-Net shows significant improvements in Hausdorff Distance, indicating more accurate boundary delineation.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation of their method:

Boundary Detail Extraction: Despite achieving pleasing segmentation results, CTC-Net's limitation lies in the extraction of boundary details. The main reason speculated is that both the CNN and Transformer encoders start recovering feature maps from $4\times$ down-sampled inputs, which inherently results in some loss of detailed spatial information at the outset. Even with skip connections and sophisticated fusion, this initial information loss can impact the precision of fine boundaries.

Based on this limitation, the authors propose future work:
Novel networks without downsampling: They plan to explore new network architectures that avoid aggressive downsampling of feature maps, aiming to maintain high resolutions and abundant spatial details throughout the network to improve boundary accuracy.

7.3. Personal Insights & Critique

This paper provides a rigorous and effective approach to combining the strengths of CNNs and Transformers for medical image segmentation. The explicit focus on complementary features and the detailed design of the Feature Complementary Module (FCM) are highly insightful. Instead of merely stacking CNN and Transformer blocks, the FCM's multi-component design to perform cross-domain fusion, correlation, and dual attention represents a thoughtful strategy for true feature synergy.

One particularly interesting finding is the success of the asymmetric decoder design. The ablation study showing that "CTC-Net with two decoders" performs worse than the single Transformer decoder CTC-Net challenges the intuitive notion of symmetric encoder-decoder architectures often seen in U-Net variants. This suggests that a single, powerful Transformer decoder, effectively fed with pre-fused complementary features from the FCM via skip connections, is more efficient and robust than trying to recover features independently through separate CNN and Transformer decoding paths. This could be due to the Transformer decoder's inherent ability to handle global context, which might be more critical for up-sampling effectively once robustly fused features are provided.

The paper's identified limitation regarding boundary details, stemming from initial downsampling, is a common challenge in many deep learning segmentation models. While Swin Transformer uses patch embedding (which itself is a form of downsampling), and CNNs typically start with convolutions and pooling, the initial reduction to $H/4 \times W/4$ resolution means a significant portion of fine-grained information is already lost. Future work exploring networks that process images at higher resolutions throughout, or employing more sophisticated upsampling methods earlier in the pipeline, would indeed be valuable. This could involve techniques like progressive resizing, super-resolution modules, or attention mechanisms specifically tuned for boundary refinement.

The methods and conclusions from this paper could potentially be transferred or applied to other domains requiring precise object delineation where both local details and global context are important, such as:

Remote Sensing Image Segmentation: Identifying buildings, roads, or land cover in satellite imagery where objects can be small or span large areas.
Industrial Defect Detection: Locating tiny defects on large surfaces, requiring both fine local analysis and awareness of overall patterns.
Autonomous Driving: Segmenting complex urban scenes with diverse objects, requiring recognition of both individual vehicles/pedestrians and overall scene layout.

A potential area for improvement or further investigation could be a more detailed analysis of the FCM's computational overhead and latency, especially given its multiple sub-blocks and Swin Transformer Blocks within the fusion path. While performance is excellent, practical deployment often requires consideration of inference speed. Additionally, exploring how the weighting factor $\alpha$ in the loss function impacts segmentation quality, particularly for boundary-sensitive tasks, could provide further insights into optimizing specific aspects of medical image segmentation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.