Paper status: completed

Techniques and Challenges of Image Segmentation: A Review

Published:03/02/2023

Foundation Model for Image Segmentation (2)Semantic Segmentation (1)Deep Learning for Image Segmentation (1)Image Processing and Computer Vision (1)Challenges in Image Segmentation Techniques (1)

Original Link

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper reviews advancements in image segmentation, categorizing techniques into classic, collaborative, and deep learning-based semantic segmentation. It highlights challenges in feature extraction and model design while analyzing key algorithms, their applicability, and futu

Abstract

Image segmentation, which has become a research hotspot in the field of image processing and computer vision, refers to the process of dividing an image into meaningful and non-overlapping regions, and it is an essential step in natural scene understanding. Despite decades of effort and many achievements, there are still challenges in feature extraction and model design. In this paper, we review the advancement in image segmentation methods systematically. According to the segmentation principles and image data characteristics, three important stages of image segmentation are mainly reviewed, which are classic segmentation, collaborative segmentation, and semantic segmentation based on deep learning. We elaborate on the main algorithms and key techniques in each stage, compare, and summarize the advantages and defects of different segmentation models, and discuss their applicability. Finally, we analyze the main challenges and development trends of image segmentation techniques.

Mind Map

In-depth Reading

English Analysis~35 min read · 49,029 chars

1. Bibliographic Information

1.1. Title

Techniques and Challenges of Image Segmentation: A Review

1.2. Authors

Ying Yu, Chunping Wang, Qiang Fu, Renke Kou, Fuyu Huang, Boxiong Yang, Tingting Yang, and Mingliang Gao. Their affiliations span multiple institutions:

Department of Electronic and Optical Engineering, Army Engineering University of PLA, Shijiazhuang 050003, China
School of Information and Intelligent Engineering, University of Sanya, Sanya 572022, China
School of Electrical and Electronic Engineering, Shandong University of Technology, Zibo 255000, China

1.3. Journal/Conference

The paper was published in Electronics, a peer-reviewed open access journal published by MDPI. Electronics covers a broad range of topics related to the science and technology of electronics, electrical engineering, and communications. MDPI journals are generally recognized in the academic community, though their impact factors can vary. In the field of computer vision and image processing, it is a recognized venue.

1.4. Publication Year

2023

1.5. Abstract

Image segmentation, a crucial step in natural scene understanding, involves dividing an image into meaningful, non-overlapping regions and is a prominent research area in image processing and computer vision. Despite significant advancements over decades, challenges persist in feature extraction and model design. This paper offers a systematic review of the progress in image segmentation methods. It categorizes the evolution into three main stages based on segmentation principles and image data characteristics: classic segmentation, collaborative segmentation (or co-segmentation), and semantic segmentation based on deep learning. For each stage, the authors detail the primary algorithms and key techniques, compare and summarize their advantages and disadvantages, and discuss their applicability. The review concludes with an analysis of current challenges and future development trends in image segmentation techniques.

1.6. Original Source Link

/files/papers/69299a334015f90af7cc618f/paper.pdf (This link points to the PDF hosted on a local file system; the abstract and citation provide the official DOI: https://doi.org/10.3390/electronics12051199, which is the officially published version).

2. Executive Summary

2.1. Background & Motivation

Image segmentation is a fundamental problem in computer vision and image processing, serving as a prerequisite for pattern recognition and image understanding. It involves partitioning an image into several meaningful and non-overlapping regions, often with the goal of extracting regions of interest (ROIs). The importance of image segmentation is underscored by its wide applications in fields such as autonomous vehicles, intelligent medical technology, image search engines, industrial inspection, and augmented reality.

Despite continuous research since the 1970s, image segmentation remains a challenging ill-posed problem due to two main difficulties:

Defining "meaningful regions": Human visual perception and comprehension are diverse and subjective, leading to ambiguity in what constitutes a "meaningful object."
Effectively representing objects: Digital images are composed of pixels, which carry low-level local features (color, texture). Obtaining global information (e.g., shape, position) from these local attributes is difficult.

Furthermore, traditional feature extraction methods based on manual or heuristic rules struggle to meet the complexity of modern image segmentation demands, especially with the increasing detail and diversity (scale, posture) of images from advanced acquisition equipment. This necessitates models with higher generalization ability. While previous reviews have covered semantic segmentation methods [5,7] or evaluation metrics [8], there was a perceived gap in systematically summarizing the evolution of image segmentation algorithms from its early stages to the present day.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Systematic Review of Evolution: It provides a comprehensive and chronological review of image segmentation methods, categorizing them into three distinct stages: classic segmentation, co-segmentation, and semantic segmentation based on deep learning. This evolutionary perspective helps in understanding the progression of techniques.
Detailed Algorithm Elaboration: For each stage, the paper elaborates on the main algorithms and key techniques, explaining their working mechanisms and enumerating influential examples.
Comparative Analysis: It compares and summarizes the advantages, defects, and applicability of different segmentation models, offering insights into their strengths and weaknesses.
Identification of Challenges and Trends: The paper analyzes the current main challenges in image segmentation, particularly for deep learning-based methods (e.g., limited annotations, class imbalance, overfitting, long training times, gradient vanishing, computational complexity, and explicability), and discusses future development trends, highlighting the shift from Convolutional Neural Networks (CNNs) to Transformers.
Focus on Deep Learning Architectures: It systematically introduces essential techniques of semantic segmentation based on deep neural networks, including encoder-decoder architectures, skip connections, dilated convolution, multiscale feature extraction, and attention mechanisms.

The key conclusions are that image segmentation has evolved from coarse-grained to fine-grained analysis, from manual feature extraction to adaptive learning, and from single-image-oriented to big-data-based common feature segmentation. Deep neural networks, especially Transformers, are identified as the leading future direction, despite challenges related to data requirements, computational cost, and model interpretability.

3.1. Foundational Concepts

Image Segmentation: The overarching goal of dividing a digital image into multiple segments (sets of pixels). The goal is to simplify and/or change the representation of an image into something more meaningful and easier to analyze. Each segment should be homogeneous in some characteristic (e.g., color, intensity, texture) and distinct from adjacent segments.
Computer Vision: A field of artificial intelligence that trains computers to "see" and interpret visual information from the world, much like humans do. Image segmentation is a core task within computer vision.
Image Processing: The manipulation of digital images using algorithms, often to improve their quality, extract information, or prepare them for further analysis. Segmentation is a form of image processing.
Pattern Recognition: A field within machine learning that focuses on the automatic recognition of patterns and regularities in data. Image segmentation provides meaningful regions that can then be used for pattern recognition (e.g., recognizing an object within a segmented region).
Pixels: The smallest individual element in a digital image. Each pixel contains a value representing color and/or intensity at a specific location.
Superpixels: Clusters of pixels that share similar visual properties (e.g., color, texture, intensity) and are spatially proximate. Superpixels are often used as a preprocessing step in segmentation to reduce computational complexity by grouping redundant pixels into perceptually meaningful atomic regions.
Grayscale Images: Images composed of shades of gray, typically ranging from black (0 intensity) to white (255 intensity). Each pixel has a single intensity value.
Color Images: Images that contain color information, typically represented using three channels (e.g., Red, Green, Blue in an RGB image). Each pixel has three intensity values, one for each channel.
Neural Networks (NNs): A class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers, which process data through computations. Deep learning refers to neural networks with many layers.
Fully Connected Layers (FC layers): In traditional neural networks, fully connected layers connect every neuron in one layer to every neuron in the next layer. These layers are common in the output stages of image classification networks but require fixed-size inputs.
Convolutional Neural Networks (CNNs): A specialized type of neural network particularly effective for processing grid-like data such as images. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images.

3.2. Previous Works

The paper organizes previous works into classic segmentation, co-segmentation, and semantic segmentation based on deep learning.

3.2.1. Classic Segmentation Methods

These methods primarily focus on analyzing single images, often relying on low-level features and requiring human intervention or prior knowledge.

Edge Detection: Identifies boundaries where image intensity or color changes sharply.
- Differential Operators (Sobel, Canny, Laplacian, Roberts, Kirsch): These operators calculate the gradient of image intensity. For example, the Sobel operator computes an approximation of the gradient of the image intensity function to highlight regions of high spatial frequency, which often correspond to edges. The Canny operator is known for its robustness to noise and ability to produce thin, continuous edges.
- Active Contours (Snakes): These methods define a deformable curve (contour) that iteratively moves towards object boundaries by minimizing an energy function. The energy function typically includes internal energy (to maintain smoothness and continuity of the curve) and external energy (to pull the curve towards image features like gradients).
- Graph Cuts: Models the image as a graph where pixels are nodes and edge weights represent similarity. Segmentation is achieved by finding a minimum cut in the graph, separating foreground and background. It's often used interactively.
Region Division: Groups pixels into regions based on similarity.
- Thresholding: Segments an image by setting a threshold value. Pixels below the threshold are assigned to one region (e.g., background), and those above to another (e.g., foreground). Otsu's method is a common technique for automatically determining an optimal threshold.
- Region Growing: Starts from a seed pixel and expands the region by adding neighboring pixels that satisfy a predefined similarity criterion (e.g., similar intensity or color).
- Watershed Algorithm: Treats the image as a topographic landscape where intensity values represent altitudes. It simulates water rising from local minima, forming "dams" at watershed lines, which represent the segmentation boundaries.
Graph Theory: Represents images as graphs to leverage graph algorithms for segmentation.
- Markov Random Fields (MRF): A probabilistic graphical model used to model the dependencies between pixels in an image. Each pixel is a node, and its label (foreground/background) depends on its neighbors. Segmentation involves finding the most probable labeling.
- Minimum Spanning Tree (MST): Used for region merging where pixels or superpixels are nodes, and edge weights represent dissimilarity. Merging criteria can be applied to edges of the MST.
Clustering: Groups pixels in a feature space (e.g., color, intensity, location) into clusters, where each cluster corresponds to a segment.
- K-means Clustering: An iterative algorithm that partitions data points into $K$ clusters. It assigns each data point to the cluster with the nearest mean (centroid) and then updates the centroids.
- Mean-shift: A non-parametric clustering algorithm that finds modes (peaks) in the data density. It iteratively shifts each data point towards the mean of data points in its neighborhood.
- Simple Linear Iterative Clustering (SLIC): An algorithm that efficiently generates superpixels by applying K-means clustering in a 5D feature space (L, a, b color values, and x, y pixel coordinates).

3.2.2. Co-Segmentation Methods

These methods extend classic segmentation by finding common foreground objects across a set of images, often without explicit annotation, falling under semi-supervised or weakly supervised paradigms. This leverages shared information between images to improve robustness.

MRF-Based Co-Segmentation [24,25,26,27,28,29,30]: Extends MRF segmentation by adding a co-segmentation term ( $E_g$ ) to the energy function, penalizing inconsistency of foreground features across multiple images.
Random Walks-Based Co-Segmentation [33,34,35]: Extends the random walks model to multiple images, often by incorporating global terms or supervoxels for 3D data.
Active Contours-Based Co-Segmentation [38,39,40]: Adapts active contour models by constructing energy functions that consider foreground consistency across images and background inconsistency within each image.
Clustering-Based Co-Segmentation [41,42,43]: Applies spectral clustering or discriminative clustering to group superpixels or local regions, then propagates segmentation results across image sets.
Graph Theory-Based Co-Segmentation [44,45]: Builds digraphs where nodes represent local regions (e.g., superpixels), and edges represent similarity and saliency, converting co-segmentation to a shortest path problem.
Thermal Diffusion-Based Co-Segmentation [46,47]: Utilizes concepts from anisotropic diffusion and temperature maximization to achieve multi-category co-segmentation by maximizing segmentation confidence.
Object-Based Co-Segmentation [48,49,50]: Measures similarity between candidate foreground objects or uses object detection to identify common objects across images.

3.2.3. Semantic Segmentation Based on Deep Learning

This stage leverages deep neural networks to learn complex feature representations and perform end-to-end segmentation, achieving state-of-the-art performance, especially with large annotated datasets.

Patch Classification [53]: An early approach where small image patches were fed into a neural network (typically a CNN) for classification, and the central pixel of the patch was labeled according to the patch's class. This was computationally expensive and ignored context.
Fully Convolutional Networks (FCN) [54]: Pioneered end-to-end semantic segmentation by replacing the fully connected layers of traditional CNNs with convolutional layers, allowing for arbitrary input image sizes and producing a spatial output map. This laid the foundation for modern deep learning-based segmentation.
U-Net [64]: An encoder-decoder architecture with crucial skip connections that concatenate feature maps from the encoder to corresponding layers in the decoder, preserving fine-grained spatial information often lost during downsampling. It was originally designed for medical image segmentation.
SegNet [61]: Another encoder-decoder architecture that transfers max-pooling indices from the encoder to the decoder during upsampling, allowing for more precise boundary recovery.
DeepLab Series [65,66,67,68]: A family of models that introduced atrous convolution (also known as dilated convolution) to enlarge the receptive field without increasing computational cost or losing resolution. They also incorporated Atrous Spatial Pyramid Pooling (ASPP) for multiscale feature extraction and Conditional Random Fields (CRFs) for boundary refinement (in earlier versions).
PSPNet (Pyramid Scene Parsing Network) [72]: Employs a Pyramid Pooling Module (PPM) to aggregate context information from different regions and at various scales, capturing both local and global features.
Attention Mechanisms [78,79,81,83,85,86,87]: Integrate attention to allow models to selectively focus on relevant parts of the input, improving long-range dependency modeling and feature representation. Examples include RNN and LSTM based approaches, Attention U-Net, and self-attention modules.
Transformers (ViT, Swin Transformer) [92,93]: Originally from Natural Language Processing (NLP), Transformers are entirely based on self-attention mechanisms. Vision Transformer (ViT) applies Transformers to image recognition by treating image patches as sequences. Swin Transformer introduced a shifted windowing scheme to build hierarchical feature maps and efficiently compute self-attention locally while allowing for cross-window connections.

3.3. Technological Evolution

The technological evolution of image segmentation can be broadly summarized by three transitions:

From low-level features to high-level semantics: Early methods focused on basic pixel properties like intensity and color, while modern deep learning approaches learn hierarchical, abstract features that capture semantic meaning.
From single-image processing to multi-image context: Classic segmentation treated images in isolation. Co-segmentation recognized the value of common information across image sets. Deep learning implicitly learns generalizable features from vast datasets.
From manual/heuristic design to adaptive learning: Classic segmentation often required manually crafted features or rule-based algorithms. Deep learning models, especially CNNs and Transformers, adaptively learn features and segmentation rules directly from data.
From fixed-size inputs to arbitrary-size inputs (FCN): Traditional CNNs for classification had fully connected layers requiring fixed input sizes. FCNs made end-to-end semantic segmentation possible for images of any dimension.
From CNN-centric to Transformer-centric (Current): While CNNs dominated for years, Transformers are now gaining significant traction due to their ability to model long-range dependencies effectively, potentially leading to new breakthroughs.

3.4. Differentiation Analysis

This paper differentiates itself from previous reviews (e.g., [5,7,8]) by:

Evolutionary Perspective: It explicitly structures the review around the historical development of image segmentation, from its classic roots through co-segmentation to modern deep learning, providing a coherent narrative of how the field has progressed.
Comprehensive Coverage: It aims to cover the full spectrum of image segmentation, not just semantic segmentation or specific aspects like evaluation metrics.
Systematic Categorization: The paper categorizes methods based on underlying principles and data characteristics (classic, collaborative, deep learning-based semantic), offering a clear framework for understanding the diverse techniques.
Focus on Key Techniques within Deep Learning: Beyond just listing models, it breaks down the core technical innovations within deep learning architectures (e.g., encoder-decoder, skip connections, dilated convolution, multiscale feature extraction, attention mechanisms), which is crucial for a beginner-friendly deep dive.

4. Methodology

The paper systematically reviews image segmentation methods across three historical stages: classic segmentation, co-segmentation, and semantic segmentation based on deep learning. Each stage represents a significant shift in the underlying principles, feature extraction techniques, and the type of image data processed.

4.1. Classic Segmentation Methods

Classic segmentation algorithms were primarily developed for grayscale images and focused on identifying regions based on gray-level similarity (for region division) or gray-level discontinuity (for edge detection). For color images, they often involved an initial step of segmenting into superpixels based on pixel similarity, followed by merging.

4.1.1. Edge Detection

Edge detection aims to locate points where gray-level changes sharply, indicating boundaries between different regions. These are also known as parallel boundary techniques.

Differential Operators: These operators approximate the derivative of the gray-level function to identify sharp changes.
- Sobel Operator: Computes the gradient magnitude using a pair of $3 \times 3$ convolution kernels, one for horizontal (SobelX) and one for vertical (SobelY) intensity gradients. An example is shown in Figure 2.
- Kirsch Operator: Uses 8 directional convolution kernels to detect edges in various orientations.
- Roberts Cross Operator: A simple $2 \times 2$ operator for detecting edges, typically applied diagonally.
- Canny Operator: A multi-stage algorithm involving Gaussian smoothing (for noise reduction), gradient magnitude and direction calculation, non-maximum suppression (to thin edges), and hysteresis thresholding (to connect edges). It is noted for its strong denoising ability and good continuity and fineness of edges.
- Laplacian Operator: A second-order derivative operator that detects zero-crossings in the second derivative, often used to find locations of rapid intensity change.
- Limitations: Differential operators are sensitive to noise and may result in discontinuous boundaries, especially in high-detail regions. Pre-smoothing is often required. The Canny operator is more complex but performs better.
  
  The following figure (Figure 2 from the original paper) shows the edge detection results of different differential operators.
  
  $Figure 2. Edge detection results of different differential operators. (a) Original (b) SobelX (c) SobelY (d) Sobel (e) Kirsch (f) Roberts $\\mathbf { \\tau } ( \\mathbf { g } )$ Canny and (h) Laplacian.$ 该图像是图表，展示了不同微分算子的边缘检测结果。其中(a)为原图，(b)至(h)分别为SobelX、SobelY、Sobel、Kirsch、Roberts、Canny与Laplacian的处理结果。
Serial Boundary Techniques: These methods connect individual edge points to form closed boundaries.
- Graph-searching algorithms: Represent edge points as nodes in a graph and search for minimum-cost paths that form closed boundaries. These are computationally intensive.
- Dynamic Programming (DP): Utilizes heuristic rules to reduce the computational cost of graph searching.
Active Contours (Snakes): This method approximates object contours by evolving an initial closed curve (initial contours) towards local image features (e.g., strong gradients). It minimizes an energy function that typically balances internal energy (for curve smoothness) and external energy (for alignment with image features).
- Limitations: Sensitive to the initial contour's placement and prone to local minima, making it difficult to converge to concave boundaries. Lankton and Tannenbaum [9] proposed a framework using local segmentation energy to address some of these issues.
Graph Cuts: An interactive segmentation method where pixels are represented as nodes in a graph. Source nodes are marked as foreground, and sink nodes as background. Edge weights between nodes represent pixel similarity or fit to foreground/background. Segmentation is achieved by finding a min-cut (a set of edges whose removal disconnects source from sink with minimum total weight), which corresponds to minimizing an energy function. It's an NP-hard problem, requiring efficient approximation algorithms like swap or expansion algorithms. Freedman [10] combined graph cuts with shape prior knowledge to improve accuracy.

4.1.2. Region Division

Region division strategies group pixels into homogeneous regions.

Thresholding: A parallel region division algorithm that segments an image based on gray-level values. An optimal grayscale threshold is determined, often by analyzing the gray histogram to maximize the discriminability between categories (e.g., using zeroth-order or first-order cumulant moments).
Serial Region Techniques: These involve multiple sequential steps for region segmentation.
- Region Growing: Starts with seed pixels and iteratively adds neighboring pixels that share similar features (e.g., gray value) until no more pixels can be merged.
- Region Merging: Similar to region growing, but typically starts with many small regions (e.g., superpixels) and merges adjacent regions if their similarity (e.g., difference in average gray value) is below a threshold. It can handle noise and occlusion but has high computational cost and difficulties in defining stopping rules.
Watershed Algorithm: Based on topographic concepts. It divides an image by simulating water rising from local minima, forming "dams" at the watershed lines (segmentation boundaries). It provides closed contours and is efficient but prone to false segmentation in complex images (over-segmentation). This can be mitigated by using a Gaussian mixture model (GMM). It's effective for medical images with overlapping cells.
Superpixels: Superpixels are small, irregular regions composed of pixels with similar properties (e.g., brightness, color, texture, position). They serve as a preprocessing step to reduce image complexity by operating on perceptual regions rather than individual pixels. Methods include clustering and graph theory.

4.1.3. Graph Theory

Image segmentation based on graph theory maps an image to a weighted graph.

Pixels or regions become vertices, and the similarity between them becomes edge weights.
Segmentation is then framed as a problem of dividing vertices in the graph to obtain optimal segmentation (e.g., min-cut).
Graph-based Region Merging: Uses metrics to achieve optimal global grouping. Felzenszwalb et al. [11] used a minimum spanning tree (MST) to merge pixels after image representation as a graph.
Markov Random Fields (MRF): Introduces probabilistic graphical models (PGMs) to represent the randomness of lower-level features. An undigraph is built where each vertex is a feature, and each edge represents a relationship. The Markov property states that a feature at any point is only related to its adjacent features. Segmentation seeks to find the most likely labeling of pixels based on this probabilistic model.
Spectral Graph Partitioning: Leordeanu et al. [12] used spectral graph partitioning to find correspondence between feature sets, building an adjacency matrix $M$ and using its principal eigenvectors to recover correct assignments.

4.1.4. Clustering Method

Clustering algorithms group pixels or superpixels into segments based on feature similarity.

K-means Clustering: An iterative algorithm that partitions data into $K$ clusters.
- Algorithm:
  1. Initialize $K$ points as cluster centers.
  2. Calculate the distance between each pixel and the K cluster centers, assigning each pixel to the cluster with the minimum distance.
  3. Recalculate the cluster centers by averaging the pixels in each cluster (moving the center to the centroid).
  4. Repeat steps 2 and 3 until the algorithm converges (cluster assignments no longer change significantly).
- Advantages: Noise robustness, quick convergence.
- Limitations: Not suitable for non-adjacent regions, converges only to local optimum, sensitive to initial center selection.
Mean-shift [13]: A density estimation-based clustering algorithm that models the image feature space as a probability density function and finds modes (high-density areas).
Fuzzy C-means [14]: An extension of K-means that assigns pixels to clusters with a degree of membership, allowing pixels to belong to multiple clusters simultaneously, and integrates spatial information.
Spectral Clustering: A graph theory-based clustering method that divides a weighted graph into subgraphs with low coupling and high cohesion.
Simple Linear Iterative Clustering (SLIC) [15]: Uses K-means in a 5D space (L, a, b color channels and x, y coordinates) to efficiently generate superpixels. The results are shown in Figure 3.
Linear Spectral Clustering (LSC) [16]: A superpixel segmentation algorithm that maps pixel coordinates and values into a high-dimensional space using a kernel function, then weights points to obtain an optimal solution aligned with K-means and normalized cut objectives.

The following figure (Figure 3 from the original paper) shows SLIC segmentation results (number of superpixels: 10, 20, 50, and 100).

该图像是图表，展示了SLIC分割结果，分别使用了10、20、50和100个超像素进行图像分割。每个子图展示了不同数量超像素下的分割效果，表明随着超像素数量的增加，图像细节的捕捉程度和分割效果也有所不同。

4.1.5. Random Walks

Random walks is a graph theory-based segmentation algorithm that assigns labels to pixels based on the probability of a random walker reaching pre-marked foreground or background seed points.

Grady et al. [20] transformed segmentation into a discrete Dirichlet problem. The image is converted into a connected weighted undigraph. Foreground and background are marked with seed points. For unmarked pixels, the algorithm calculates the probability of a random walk starting from that pixel reaching a foreground seed or a background seed first. The pixel is then assigned the category with the highest probability.
Yang et al. [21] proposed a constrained random walks algorithm where user input (foreground/background scribbles, hard/soft boundary constraints) guides the segmentation.
Lai et al. [22] extended random walks to 3D mesh images, defining edge weights by dihedral angles between adjacent faces. Zhang et al. [23] improved this with a fast geodesic curvature flow (FGCF) algorithm, reducing vertices and smoothing contours.

4.2. Co-Segmentation Methods

Co-segmentation (or collaborative segmentation), introduced by Rother et al. [24] in 2006, aims to extract common foreground regions from a set of images without human intervention, leveraging prior knowledge shared across the images. This approach is semi-supervised or weakly supervised.

The general extended model for co-segmentation is expressed by an energy function $E$ : $ E = E_s + E_g $ Where:

$E_s$ : Represents the energy function for seed image segmentation. This term describes the difference between the foreground and background within a single image and the smoothness of its segmentation. It's essentially the energy function from classic segmentation methods.
$E_g$ : Represents the energy function for co-segmentation. This term describes the similarity between the foregrounds across the entire set of images, enforcing consistency. The goal is to minimize $E$ to achieve good co-segmentation.

4.2.1. MRF-Based Co-Segmentation

Rother et al. [24] extended the MRF segmentation by utilizing prior knowledge to solve ill-posed problems in multiple image segmentation. They first segment a seed image and assume foreground objects across the image set are similar. The MRF segmentation energy $E_s^{MRF}$ is typically composed of unary potential and pairwise potential: $ E_s^{MRF} = \mathrm{E}_u^{MRF} + E_p^{MRF} $ Where:

$\mathrm{E}_u^{MRF} = \textstyle \sum _ { x _ { i } } E _ { u } ( x _ { i } )$ : The unary potential measures the property of a pixel itself, representing the probability of pixel $i$ belonging to class $x_i$ given its feature $y_i$ .
$E_p^{MRF} = \textstyle { \bar { \sum _ { x _ { i } , x _ { j } \in \Psi } E _ { p } \left( { \bar { x _ { i } } } , x _ { j } \right) } }$ : The pairwise potential measures the relationship between adjacent pixels, representing the probability that two adjacent pixels $i$ and $j$ belong to the same category. The co-segmentation term $E_g$ is used to penalize inconsistencies in foreground color histograms across the image set.
Optimization: Subsequent research focused on optimizing global constraints. Vicente et al. [25] used a multiscale decomposition for an extended Boykov-Jolly model. Rubio et al. [28] introduced high-order graph matching into MRF for global terms. Chang et al. [29] proposed a universal significance measure to add foreground positional information. Yu et al. [30] combined a co-saliency model with a Gaussian mixture model (GMM) for dissimilarity between foreground objects, minimized iteratively via graph cuts.

4.2.2. Co-Segmentation Based on Random Walks

Collins et al. [33] extended random walks to co-segmentation, optimizing with quasiconvexity and providing a CUDA library for sparse feature operations.
Fabijanska et al. [34] proposed an optimized random walks for 3D voxel image segmentation using supervoxels to save time and memory.
Dong et al. [35] introduced a subMarkov random walks (subRW) algorithm with prior label knowledge, effective for slender objects.

4.2.3. Co-Segmentation Based on Active Contours

Meng et al. [38] extended active contours by constructing an energy function based on foreground consistency between images and background inconsistency within images, solved by level set methods.
Zhang et al. [39] proposed a deformable co-segmentation algorithm for brain MRI segmentation, transforming brain anatomy priors into constraints and minimizing energy via level set.
Zhang et al. [40] introduced image saliency into active contours and used a level set optimization based on superpixels, hierarchical computing, and convergence judgment.
Limitations: Unidirectional movement of active contours limits flexibility, making it difficult for objects with weak edges.

4.2.4. Clustering-Based Co-Segmentation

This is an extension of single-image clustering segmentation.

Joulin et al. [41] used spectral clustering for single-image segmentation based on local spatial information, then discriminative clustering to propagate results across an image set.
Kim et al. [42] divided images into superpixels, represented their relevance with a weighted graph and affinity matrix, and used spectral clustering for co-segmentation. This hierarchical graph clustering is illustrated in Figure 5.
Joulin et al. [43] used spectral clustering based on feature positions and color vectors for local information, then expectation maximization (EM) to minimize a classification discriminant function for multi-object, multi-class co-segmentation.

The following figure (Figure 5 from the original paper) shows an illustration of hierarchical graph clustering constructed between two images.

$Figure 5. An illustration of hierarchical graph clustering constructed between two images. Figure from \[42\].$ 该图像是示意图，展示了基于层次图聚类的图像分割过程。图中左侧和右侧分别显示了两张包含狗的原图，并通过中间的计算步骤表示了从原图到分割图的变换过程。图中的数学公式包括矩阵形式表示的权重 $W$ 和相关约束条件 $C$ ，显示了在不同分割阶段的特征提取和处理。整体结构清晰地展示了图像分割算法的工作流程。

4.2.5. Co-Segmentation Based on Graph Theory

Meng et al. [44] constructed a digraph where nodes were local regions (from object detection, not pixels or superpixels). Directed edges represented local region similarity and saliency maps. Co-segmentation became a shortest path problem, solved by dynamic programming (DP). The framework is shown in Figure 6.
Meng et al. [45] proposed a co-saliency model for pairwise-constrained images, extracting dual-constrained saliency maps (single-image and multiple-image saliency) via pairwise-constrained graph matching, solved by DP.

The following figure (Figure 6 from the original paper) shows the framework of the co-segmentation based on the shortest path algorithm.

$Figure 6. Framework of the co-segmentation based on the shortest path algorithm. Figure from \[44\].$ 该图像是一个示意图，展示了基于最短路径算法的协同分割框架。图中包括多个局部区域生成、图构建以及最短路径搜索算法的流程，并展示了原始图像、显著性图以及最终的输出结果。这些步骤构成了图像分割过程的重要环节。

4.2.6. Co-Segmentation Based on Thermal Diffusion

This method maximizes system temperature by changing heat source locations to achieve optimal segmentation. Anisotropic diffusion is often used for noise reduction while preserving edges.
Kim et al. [46] proposed CoSand, using temperature maximization modeling on anisotropic diffusion to achieve large-scale multi-category co-segmentation by maximizing segmentation confidence.
Kim et al. [47] achieved multi-foreground co-segmentation by iteratively performing scene modeling (local feature extraction with spatial pyramid matching, linear SVM for matching, GMM for classification) and region labeling.

4.2.7. Object-Based Co-Segmentation

Alexe et al. [48] quantified the possibility of an image window containing objects of any category, using Bayesian theory to find high-scoring windows as feature calibration.
Vicente et al. [49] measured similarity between foreground objects, extracting top-scoring features from multiple candidate classes.

Meng et al. [50] proposed a multi-group image co-segmentation framework using MRF and a dense mapping model, solved by EM, to achieve multi-foreground recognition by generating accurate prior knowledge.

The following are the results from Table 1 of the original paper, comparing and analyzing main co-segmentation methods:

Methods	Ref.	Foreground Feature	Co-Information	Optimization
MRF-Based Co-Segmentation	[24]	color histogram	L1 norm	graph cuts
	[26]	color histogram	L2 norm	quadratic pseudo-Boolean
	[27]	color and texture histograms	reward model	maximum flow
	[25]	color histogram	Boykov—Jolly model	dual decomposition
	[46]	color and SIFT features	region matching	graph cuts
	[29]	SIFT feature	K-means + L1, 2	graph cuts
	[48]	SIFT feature	Gaussian mixture model (GMM) constraint	graph cuts
	[33]	color and texture histograms	improved random walk global term	gradient projection and conjugate gradient (GPCG)
Co-Segmentation Based on Random Walks	[34]	intensity and gray difference	improved random walk global term	graph size reduction
	[35]	label prior from user scribbles	GMMs	minimize the average reaching probability
	[38]	color histogram	reward model	level set function
Co-Segmentation Based on Active Contours	[39]	co-registered atlas and statistical features	k-means	level set function
	[40]	saliency information	improved Chan-Vese (C-V) model	level set function
	[41]	SIFT, Gabor filter, color histogram	Chi-square distance	low-rank
Clustering-Based Co-Segmentation	[43]	color and location information	discriminant clustering	expectation maximization (EM)
	[42]	pyramid of LAB colors, HOG textures, SURF features histogram	hierarchical clustering	normalized cut criterion
	[44]	color histogram	built digraphs according to region similarity and saliency	shortest path
Co-Segmentation based on Graph Theory	[45]	color and shape information	build global items based on digraphs and saliency	shortest path
Co-Segmentation based on Graph Theory	[46]	lab space color and texture information	Gaussian consistency	Sub-modularity optimization
Co-Segmentation Based on Thermal Diffusion	[47]	color and texture histograms	GMM & SPM (spatial pyramid matching)	dynamic programming
Co-Segmentation Based on Thermal Diffusion	[48]	multi-scale saliency, color contrast, edge density and	Bayesian framework	maximizing the posterior probability
Object-Based Co-Segmentation	[49]	superpixels straddling 33 types of features	random forest classifier	A-star search algorithm
Object-Based Co-Segmentation

4.3. Semantic Segmentation Based on Deep Learning

With the increased complexity of images and computational power, deep learning methods have become dominant. Early approaches like patch classification [53] were limited, but Fully Convolutional Networks (FCNs) [54] revolutionized the field by enabling end-to-end semantic segmentation for arbitrary image sizes.

The following figure (Figure 7 from the original paper) shows the Fully Convolutional Networks architecture.

Figure 7. Fully convolutional networks architecture. 该图像是一个示意图，展示了全卷积网络的架构，其中包含输入模块、编码器模块、上采样过程以及输出。该结构通过跳跃连接对不同层的预测结果进行整合，以生成最终的分割输出。

4.3.1. Encoder-Decoder Architecture

This architecture is fundamental to many modern semantic segmentation networks, building upon FCNs.

Encoder Stage: Typically composed of convolutional and pooling operations.
- Convolution Operation: Involves sliding a convolutional kernel (a small matrix of weights) over the image, performing element-wise multiplication with the pixels in the receptive field, summing the results, and applying an activation function (e.g., ReLU) to produce a feature map. This extracts hierarchical features.
- Pooling Operation: Downsamples feature maps to reduce spatial dimensions and computation, while retaining important information. Max-pooling selects the maximum value in a pooling window.
- Backbone Networks: Pre-trained image classification CNNs (e.g., VGG [57], Inception [58,59], ResNet [60]) are commonly used as encoders to extract high-dimensional semantic features.
Decoder Stage: Aims to generate a semantic segmentation mask by mapping the high-dimensional features back to the original image size. This involves up-sampling.
- Interpolation: A simple up-sampling method that inserts new pixels between existing ones using strategies like bilinear or bicubic interpolation. It does not require learned parameters.
- Deconvolution (Transposed Convolution): Also known as transposed convolution, this operation reverses the process of convolution, effectively expanding the spatial resolution of the feature map by learning parameters. FCNs utilize deconvolution for up-sampling.
- Unpooling: Used in networks like SegNet [61]. During the max-pooling operation in the encoder, the indices (locations) of the maximum values within each pooling window are recorded. Unpooling uses these indices to place the max-pooled values back into their original positions in the decoder, setting other positions to zero, thus preserving boundary information.
- Dense Up-sampling Convolution (DUC) [62]: Converts label mapping in a feature map into smaller label mappings with multiple channels, achieved directly by convolutions without extra interpolation.

4.3.2. Skip Connections

Skip connections (or shortcut connections) address the degradation problem in deep networks (where performance decreases with depth) and improve pixel positioning by providing direct pathways for information flow.

ResNet [60] and DenseNet [63]: Introduced skip connections to allow gradients to flow more easily through the network, helping to train very deep models.
U-Net [64]: Proposed a novel long skip connection architecture, as shown in Figure 8. It concatenates feature maps from the encoder (which contain fine-grained spatial details) to corresponding decoder layers (which contain high-level semantic information). This fusion helps the decoder to produce precise segmentations with accurate boundaries. U-Net was initially designed for biomedical image segmentation and is widely adopted in medical image analysis.

The following figure (Figure 8 from the original paper) shows the U-Net architecture.

$Figure 8. U-Net architecture. Figure from \[64\].$ 该图像是U-Net架构的示意图，展示了输入图像的处理流程及输出分割图的生成。该网络通过多层卷积和池化操作，逐步提取图像特征，最后生成对应的分割图。

4.3.3. Dilated Convolution

Dilated convolution (also known as atrous convolution) expands the receptive field of convolutional kernels without increasing the number of parameters or losing spatial resolution. This is achieved by inserting "holes" or gaps between kernel elements.

DeepLab V1 [65]: Replaced max-pooling layers with dilated convolution to maintain high resolution of feature maps and address the loss of transfer invariance. It also used fully connected Conditional Random Fields (CRFs) for post-processing to refine segmentation boundaries and capture multi-scale context.
Multi-Scale Context Aggregation [69]: Yu et al. [69] used dilated convolution to aggregate multiscale context information with a context module applying 3x3 convolutional kernels with varying dilation factors (e.g., [1, 1, 2, 4, 8, 16, 1]).
Dilated Residual Network (DRN) [70]: Based on ResNet, DRN removed downsampling in later convolutional groups (G4 and G5) and instead applied dilated convolutions with rates $r=2$ and $r=4$ respectively to maintain spatial resolution.
Hybrid Dilated Convolution (HDC) [62]: Proposed to address the gridding problem (when dilated convolutions with the same rate cause a sparse, checkerboard-like sampling of information). HDC uses different dilation rates for consecutive layers, ensuring that the final receptive field completely covers a square region without holes.

4.3.4. Multiscale Feature Extraction

Capturing multiscale features and context information is crucial for segmenting objects of varying sizes and understanding scene layout.

Spatial Pyramid Pooling (SPP) [71]: Introduced to overcome the fixed-size input requirement of CNNs by pooling features from multiple spatial bins, producing a fixed-length output regardless of input size. It proved effective for semantic segmentation and object detection.
Pyramid Scene Parsing Network (PSPNet) [72]: Utilized a Pyramid Pooling Module (PPM) to extract and aggregate features at different scales, combining local and global context information. As shown in Figure 9, PPM takes the final feature map from a backbone network (e.g., ResNet), applies pooling at various scales, convolutions to reduce dimensionality, and then up-samples and concatenates these features. The number of pyramid layers and their sizes are adaptable.

The following figure (Figure 9 from the original paper) shows the PSPNet with the pyramid pooling module.

$Figure 9. The PSPNet with the pyramid pooling module. Figure from \[72\].$ 该图像是一个示意图，展示了PSPNet与金字塔池化模块的结构。图中包含输入图像、特征图、金字塔池化模块及最终预测的过程，揭示了模型在图像分割中的工作流程。
DeepLab V2 [66]: Introduced Atrous Spatial Pyramid Pooling (ASPP) to capture multiscale features by applying parallel dilated convolutions with different dilation rates, as shown in Figure 10.
DeepLab V3 [67]: Further refined ASPP by applying both cascade modules and parallel modules of dilated convolution, grouping parallel convolutions, and adding a 1x1 convolution layer and batch normalization within ASPP. It significantly improved performance without DenseCRF post-processing.
DeepLab V3+ [68]: A new encoder-decoder structure using DeepLab V3 as the encoder and Xception as the backbone. It adopted dilated depth-wise separable convolutions for efficient feature extraction and batch normalization to refine segmentation boundaries.

The following figure (Figure 10 from the original paper) shows the Atrous spatial pyramid pooling module.

$Figure 10. Atrous spatial pyramid pooling module. Figure from \[66\].$ 该图像是一个示意图，展示了Atrous空间金字塔池化模块的结构。图中显示了不同卷积核率（rate）的卷积层，分别为6、12、18和24，配合3x3的卷积核，对输入特征图进行不同的特征提取。
Feature Pyramid Network (FPN) [74]: Similar to U-Net's skip connections, FPN constructs a feature pyramid with both high resolution (from shallow layers) and strong semantics (from deep layers), beneficial for object detection with varied object sizes.
Adaptive Pyramid Context Network (APCNet) [75]: Uses multiple Adaptive Context Modules (ACMs) to build multiscale contextual feature representations. Each ACM uses global image representation to estimate local affinity weights for subregions and calculates optimal context vectors.
Enhanced Feature Pyramid Network (EFPN) [76]: Combines a Semantic Enhancement Module (SEM), Edge Extraction Module (EEM), and Context Aggregation Module (CAM) in the decoder, and a Global Fusion Module (GFM) in the encoder to improve robustness of multi-level feature fusion and capture deep semantic information.
FPANet (Feature Pyramid Aggregation Network) [77]: A real-time encoder-decoder model. The encoder uses ResNet and ASPP. The decoder uses a Semantic Bidirectional Feature Pyramid Network (SeBiFPN) with a lightweight feature pyramid fusion module (FPFM) to fuse semantic and spatial information across different levels.

4.3.5. Attention Mechanisms

Attention mechanisms, originally from Natural Language Processing (NLP), allow models to selectively focus on relevant parts of the input, improving the modeling of dependencies between regions (especially long-distance ones) and channels.

RNN-based Attention [78,79]: Recurrent Neural Networks (RNNs) can model short-term dependencies. The ReSeg network [79] (based on ReNet [80]) uses RNNs sweeping horizontally and vertically across the image to capture global context. The ReSeg architecture is shown in Figure 11.
LSTM-based Attention [81,82]: Long Short-Term Memory (LSTM) networks extend RNNs with memory cells to model long-distance dependencies. Byeon et al. [81] used LSTM for pixel-for-pixel segmentation. Liang et al. [82] proposed graph LSTM to enhance global context visual features.

The following figure (Figure 11 from the original paper) shows the ReSeg architecture.

$Figure 11. The ReSeg architecture. Figure from \[79\].$ 该图像是示意图，展示了 ReSeg 架构的层次结构和特征图的处理过程。左侧为输入图像，后续几个立方体表示不同卷积层的特征图，箭头指向特征图的传递关系，最后输出为 32x32 的分割结果。
Attention U-Net [83]: Introduced attention gates (AGs) into the U-Net architecture, as shown in Figure 12. Before concatenating encoder features with decoder features, AG modules supervise encoder features using decoder features, adaptively readjusting output features. AGs generate a gated signal to suppress irrelevant background regions and highlight salient features.
Attention UW-Net [84]: Improves U-Net with dense skip connections and modified attention gates for medical chest X-ray images, enhancing attention to salient regions and suppressing background.

The following figure (Figure 12 from the original paper) shows the attention U-Net architecture.

$Figure 12. The attention U-Net architecture. Figure from \[83\].$ 该图像是示意图，展示了注意力 U-Net 架构的工作流程，输入图像经过多层卷积处理生成分割图。图中明确标出了各个阶段所用的卷积操作、上采样、最大池化和注意力门等关键技术。该架构通过跳跃连接保留特征信息，提高了分割的准确性和效率。
Self-Attention Mechanisms: Used in encoder networks to model correlations between different pixels or channels within a single feature map. It computes a weighted sum of pairwise affinities across all positions.
- Influential achievements include PSANet [85], DANet [86], APCNet [75], CARAFE [87], and $CARAFE++$ [88].
Transformer [89]: A deep neural network based solely on self-attention, entirely dispensing with convolutions and recurrence.
- Vision Transformer (ViT) [92]: Applied Transformers to image recognition. It divides images into fixed-size patches, flattens them into a sequence vector, and inputs them into a Transformer encoder (composed of multi-head attention layers and multi-layer perceptrons (MLPs)). The ViT model is shown in Figure 13.
- Swin Transformer [93]: Achieved impressive performance in image semantic segmentation and instance segmentation. It introduced a shifted windowing approach, calculating self-attention within local windows and using cyclic-shifting window partitions to introduce cross-window connections between neighboring non-overlapping windows. The Swin Transformer architecture is shown in Figure 14.
  
  The following figure (Figure 13 from the original paper) shows the ViT model.
  
  $Figure 13. The ViT model. Figure from \[92\].$ 该图像是示意图，展示了视觉转换器（ViT）模型的结构和工作流程。图中左侧显示了输入数据经过线性投影和平铺后，将其嵌入到变换编码器中，而右侧展示了变换编码器的主要组成部分，包括多头注意力机制和前馈神经网络的结构。特征提取采用了多层的 MLP 和归一化策略。

The following figure (Figure 14 from the original paper) shows the architecture of a swin transformer.

$Figure 14. The architecture of a swin transformer. Figure from \[93\].$ 该图像是示意图(a)展示了Swin Transformer的架构，包含多个阶段和模块，其中包括Patch Partition、Swin Transformer Block、Patch Merging等步骤。图中展示了图像数据的处理流程，以及每个阶段的特征表示。

5. Experimental Setup

As a review paper, this document does not present new experimental results but summarizes the experimental setups and outcomes of the reviewed works, particularly in the semantic segmentation section.

5.1. Datasets

The paper mentions and lists various datasets used by the reviewed deep learning models in Table 2. These datasets cover different domains and characteristics:

PASCAL VOC (Visual Object Classes) [54,65,66,67,68,70,72,75,76,85,86]: A widely used dataset for object detection, segmentation, classification, and action recognition. It contains natural images with annotations for 20 foreground object classes and one background class.
NYUDv2 [54]: An indoor scene dataset providing RGB-D (color and depth) images.
PhC-U373 [64]: A dataset for phase contrast microscopy images of glioblastoma-astrocytoma cells, used in medical image segmentation.
DIC-HeLa [64]: A dataset for differential interference contrast microscopy images of HeLa cells, also for medical image segmentation.
CamVid [61,64,73,79,77]: A road scene understanding dataset with pixel-level semantic labels for driving scenarios, crucial for autonomous vehicles.
SUN RGBD [61]: A dataset for 3D scene understanding in indoor environments, providing RGB-D images and semantic segmentation labels.
Cityscapes [61,62,66,67,70,72,77,86]: A large-scale dataset for urban street scenes, providing semantic, instance, and panoptic annotations for understanding driving environments.
ADE20K [75,85,87,88]: A comprehensive dataset for scene parsing, containing diverse images with detailed pixel-level annotations for various object and stuff categories.
PASCAL Context [75,86]: Extends PASCAL VOC by providing pixel-level labels for the entire image, including both objects and "stuff" (e.g., grass, sky).
COCO Stuff [86]: Similar to PASCAL Context, provides pixel-level annotations for both objects and "stuff" categories from the COCO dataset.
TCIA Pancreas CT-82 [83]: A medical imaging dataset of CT scans for pancreas segmentation.
NIH Chest X-ray [84]: A medical imaging dataset of chest X-ray images, often used for tasks like lung segmentation or pathology detection.

These datasets are chosen to validate methods across different segmentation tasks (general natural scenes, medical, autonomous driving, indoor 3D) and data modalities (RGB, RGB-D, microscopy, CT, X-ray). They are widely accepted benchmarks in their respective domains for evaluating image segmentation performance.

5.2. Evaluation Metrics

The primary evaluation metric used in the context of semantic segmentation results presented in Table 2 is Mean Intersection over Union (mIoU).

Conceptual Definition: Intersection over Union (IoU), also known as the Jaccard Index, is a standard metric used to evaluate the accuracy of an object detector or segmenter on a particular dataset. It quantifies the overlap between the predicted segmentation mask and the ground truth mask. Mean IoU (mIoU) extends this by calculating the IoU for each class present in the dataset and then averaging these IoU values over all classes. It provides a robust measure of segmentation quality, balancing the detection of objects and the accuracy of their boundaries. A higher mIoU indicates better segmentation performance.
Mathematical Formula: The IoU for a single class $C$ $C$ is calculated as: $ \mathrm{IoU}_C = \frac{|P_C \cap G_C|}{|P_C \cup G_C|} $ Where:
- $P_C$ : The set of pixels predicted as belonging to class $C$ .
- $G_C$ : The set of pixels actually belonging to class $C$ (ground truth).
- $| \cdot |$ : Denotes the cardinality (number of pixels) of a set.
- $\cap$ : Represents the intersection of two sets (correctly predicted pixels for class $C$ ).
- $\cup$ : Represents the union of two sets (all pixels that are either predicted as $C$ or are actually $C$ ).
  
  The Mean IoU (mIoU) is then calculated by averaging the IoU values over all $N_c$ classes: $ \mathrm{mIoU} = \frac{1}{N_c} \sum_{C=1}^{N_c} \mathrm{IoU}_C $ Where:
- $N_c$ : The total number of classes.
- $\mathrm{IoU}_C$ : The Intersection over Union for class $C$ .

5.3. Baselines

The paper, being a review, compares various proposed methods against each other and implicitly against their predecessor architectures. For instance, FCN is a foundational baseline for subsequent deep learning models. U-Net, SegNet, and DeepLab series models serve as baselines for more recent CNN-based architectures. The inclusion of diverse methods (e.g., MRF-Based, Random Walks, Active Contours, Clustering, Graph Theory, Thermal Diffusion, Object-Based for co-segmentation) shows a comparison against different algorithmic paradigms. ResNet and VGG are frequently mentioned as backbone networks, implying that models using these as components are compared against other models or variations of themselves.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents two main tables summarizing the characteristics and performance of the reviewed co-segmentation and semantic segmentation methods.

The following are the results from Table 1 of the original paper, comparing and analyzing main co-segmentation methods. This table details the features, co-information strategies, and optimization techniques used across different co-segmentation paradigms.

Methods	Ref.	Foreground Feature	Co-Information	Optimization
MRF-Based Co-Segmentation	[24]	color histogram	L1 norm	graph cuts
	[26]	color histogram	L2 norm	quadratic pseudo-Boolean
	[27]	color and texture histograms	reward model	maximum flow
	[25]	color histogram	Boykov—Jolly model	dual decomposition
	[46]	color and SIFT features	region matching	graph cuts
	[29]	SIFT feature	K-means + L1, 2	graph cuts
	[48]	SIFT feature	Gaussian mixture model (GMM) constraint	graph cuts
	[33]	color and texture histograms	improved random walk global term	gradient projection and conjugate gradient (GPCG)
Co-Segmentation Based on Random Walks	[34]	intensity and gray difference	improved random walk global term	graph size reduction
	[35]	label prior from user scribbles	GMMs	minimize the average reaching probability
	[38]	color histogram	reward model	level set function
Co-Segmentation Based on Active Contours	[39]	co-registered atlas and statistical features	k-means	level set function
	[40]	saliency information	improved Chan-Vese (C-V) model	level set function
	[41]	SIFT, Gabor filter, color histogram	Chi-square distance	low-rank
Clustering-Based Co-Segmentation	[43]	color and location information	discriminant clustering	expectation maximization (EM)
	[42]	pyramid of LAB colors, HOG textures, SURF features histogram	hierarchical clustering	normalized cut criterion
	[44]	color histogram	built digraphs according to region similarity and saliency	shortest path
Co-Segmentation based on Graph Theory	[45]	color and shape information	build global items based on digraphs and saliency	shortest path
Co-Segmentation based on Graph Theory	[46]	lab space color and texture information	Gaussian consistency	Sub-modularity optimization
Co-Segmentation Based on Thermal Diffusion	[47]	color and texture histograms	GMM & SPM (spatial pyramid matching)	dynamic programming
Co-Segmentation Based on Thermal Diffusion	[48]	multi-scale saliency, color contrast, edge density and	Bayesian framework	maximizing the posterior probability
Object-Based Co-Segmentation	[49]	superpixels straddling 33 types of features	random forest classifier	A-star search algorithm
Object-Based Co-Segmentation

Analysis of Table 1 (Co-Segmentation):

Feature Evolution: Initially, MRF-Based methods often relied on basic color histograms. Later MRF and other co-segmentation methods incorporated richer features like texture histograms, SIFT features, Gabor filters, LAB colors, HOG textures, and SURF features. This shows a clear trend towards more robust and descriptive foreground features. Saliency information and multi-scale saliency also emerged as important features.
Co-Information Strategies: Various strategies were developed to model co-information (similarity across images). These include L1/L2 norms, reward models, Boykov-Jolly model, region matching, K-means for feature grouping, GMM constraints, improved random walk global terms, Chi-square distance, discriminant clustering, hierarchical clustering, digraphs based on similarity/saliency, Gaussian consistency, and Bayesian frameworks. This diversity highlights the challenge of effectively capturing shared object characteristics across different images.

Optimization Techniques: Graph cuts is a dominant optimization method, especially for MRF-based approaches. Level set functions are key for active contour-based methods. Other techniques include quadratic pseudo-Boolean, maximum flow, dual decomposition, gradient projection and conjugate gradient (GPCG), graph size reduction, minimizing average reaching probability, low-rank approximation, expectation maximization (EM), normalized cut criterion, shortest path algorithms, sub-modularity optimization, dynamic programming, and A-star search. The choice of optimization often depends on the mathematical formulation of the energy function or objective function.

The following are the results from Table 2 of the original paper, comparing and analyzing semantic segmentation methods based on deep learning. This table focuses on the publication year, backbone networks, datasets, mIoU performance, and major contributions of influential deep learning models.

Algorithms	Pub. Year	Backbone	Experiments		Major Contributions
			Datasets	mIoU (%)
FCN [54]	2015	VGG-16	PASCAL VOC 2011 NYUDv2	62.7 34.0	The forerunner for end-to-end semantic segmentation
U-Net [64]	2015	VGG-16	PhC-U373 DIC-HeLa CamVid	92.03 77.56	Encoder-decoder structure, skip connections
SegNet [61]	2016	VGG-16	SUN RGBD	60.4 28.27	Transferred the max-pooling indices to the decoder
DeepLabv1 [65]	2016	VGG-16	PASCAL VOC 2012	71.6	Atrous convolution, fully connected CRFs Dilated convolutions, multi-scale
MSCA [88]	2016	VGG-16	PASCAL VOC 2012 PASCAL	75.3	context aggregation, front-end context module Reconstruction up-sampling
LRR [73]	2016	ResNet/VGG-16	VOC 2011 Cityscapes CamVid	77.5 69.7 91.6	module, Laplacian pyramid refinement
ReSeg [79]	2016	VGG-16 & ReNet	Oxford Flowers CamVid	93.7 58.8	Extension of ReNet to semantic segmentation
DRN [70]	2017	ResNet-101	Cityscapes PASCAL	70.9	Modified Conv4/5 of ResNet, dilated convolution
PSPNet [72]	2017	ResNet50	VOC 2012 Cityscapes	85.4 80.2	Spatial pyramid pooling (SPP)
DeepLab V2 [66]	2017	VGG-16/ ResNet-101	PASCAL VOC 2012 Cityscapes	79.7 70.4	Atrous spatial pyramid pooling (ASPP), fully connected CRFs
DeepLab V3 [67]	2017	ResNet-101	PASCAL VOC 2012 Cityscapes PASCAL	86.9 81.3	Cascaded or parallel ASPP modules
DeepLab V3+ [68]	2018	Xception	VOC 2012	89.0	A new encoder-decoder structure with DeepLab V3 as an encoder
DUC-HDC [62]			Cityscapes	82.1	HDC (hybrid dilation convolution) was proposed to solve the gridding caused by dilated convolutions
	2018	ResNet-101/ResNet-152	PASCAL VOC 2012	83.1
			Cityscapes	80.1
	2018	VGG-16 with AGs	TCIA Pancreas CT-82		A novel self-attention gating (AGs) filter, skip connections
Attention U-Net [83]			CT-150
PSANet [85]	2018	ResNet-101	ADE20K PASCAL VOC 2012 Cityscapes	81.51 85.7 81.4	Point-wise spatial attention maps from two parallel branches, bi-direction information propagation model
APCNet [75]	2019	ResNet-101	PASCAL VOC 2012 PASCAL Context ADE20K	84.2 54.7 45.38	Multi-scale, global-guided local affinity (GLA), adaptive context modules (ACMs)
DANet [86]	2019	ResNet-101	Cityscapes PASCAL VOC 2012 PASCAL Context COCO Stuff	81.5 82.6 52.6 39.7	Dual attention: position attention module and channel attention module
CARAFE [87]	2019	ResNet-50	ADE20k	42.23	Pyramid pooling module (PPM), feature pyramid network (FPN), multi-level feature fusion (FUSE)
EFPN [76]	2021	VGG-16	PASCAL VOC 2012 Cityscapes PASCAL Context	86.4 82.3 53.9	PPM, multi-scale feature fusion module with a parallel branch
CARAFE++ [88]	2021	ResNet-101	ADE20k	43.94	PPM, FPN, FUSE, adaptive kernels on-the-fly A novel shifted windowing scheme,
Swin Transformer [93]	2021	Swin-L	Swin-L	53.5	a general backbone network for computer vision Skip connections, an intermediate layer that combines the feature maps of the fourth-layer
Attention UW-Net [84]	2022	ResNet50	NIH Chest X-ray		encoder with the feature maps of the last-layer encoder layer, attention mechanism Bilateral directional FPN,
FPANet [77]	2022	ResNet18	Cityscapes CamVid	75.9 74.7	lightweight ASPP, feature pyramid fusion module (FPFM), border refinement module (BRM)

Analysis of Table 2 (Semantic Segmentation):

Performance Trend: There's a clear trend of increasing mIoU over time, indicating continuous improvement in semantic segmentation accuracy. Early models like FCN achieve around 62.7% mIoU on PASCAL VOC 2011, while later DeepLab V3+ reaches 89.0% on PASCAL VOC 2012. PSPNet and DeepLab series consistently show high performance on challenging datasets like Cityscapes.
Backbone Evolution: VGG-16 was a common backbone in earlier models (FCN, U-Net, SegNet, DeepLabv1). As deep learning progressed, ResNet (e.g., ResNet50, ResNet-101, ResNet-152) became the dominant backbone due to its ability to train deeper networks and mitigate vanishing gradients with skip connections. More specialized backbones like Xception (DeepLab V3+) and custom Transformer-based backbones (Swin-L for Swin Transformer) represent the cutting edge.
Architectural Innovations:
- Encoder-Decoder & Skip Connections: U-Net (2015) highlighted the importance of encoder-decoder structures and skip connections for precise segmentation, especially in medical imaging.
- Dilated Convolution & Multi-scale Context: The DeepLab series (starting 2016) extensively utilized atrous convolution and ASPP to capture multiscale context without sacrificing resolution. PSPNet (2017) introduced PPM for similar goals. DUC-HDC specifically tackled the gridding problem of dilated convolutions.
- Attention Mechanisms: Attention U-Net (2018) and PSANet/DANet (2018-2019) demonstrated the effectiveness of attention mechanisms in enhancing feature representation and modeling dependencies.
- Transformers: The Swin Transformer (2021) marked a significant shift, showing that Transformer-based architectures can serve as general backbones for computer vision, achieving strong performance in segmentation tasks.
Dataset Specialization: While PASCAL VOC and Cityscapes remain popular benchmarks for general and street scene segmentation, U-Net and Attention U-Net show high performance on specialized medical imaging datasets like PhC-U373, DIC-HeLa, TCIA Pancreas CT-82, and NIH Chest X-ray, demonstrating the adaptability of these architectures.
Real-time Considerations: The entry for FPANet (2022) highlights the ongoing focus on real-time semantic segmentation, emphasizing lightweight architectures and efficient modules like lightweight ASPP and FPFM.

6.2. Ablation Studies / Parameter Analysis

As a comprehensive review paper, this document synthesizes findings from numerous primary research articles. Therefore, it does not conduct its own ablation studies or parameter analyses. Instead, it implicitly reports on the results of such studies from the original papers by detailing the "Major Contributions" and observed performance (mIoU) improvements attributed to specific architectural components (e.g., skip connections in U-Net, ASPP in DeepLab, attention gates in Attention U-Net, shifted windowing scheme in Swin Transformer). The improvements in mIoU shown in Table 2 for successive versions of models (e.g., DeepLab V1 to DeepLab V3+) are indirect evidence of component effectiveness and hyper-parameter optimization performed by the original authors.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper provides a valuable, systematic review of image segmentation techniques, tracing their evolution from classic segmentation through co-segmentation to semantic segmentation based on deep learning. It highlights a clear developmental trajectory from coarse-grained to fine-grained analysis, manual feature extraction to adaptive learning, and single-image-oriented methods to approaches that leverage big data for common feature learning. The paper thoroughly elaborates on the main algorithms and key techniques at each stage, offering a comprehensive overview of influential models and their respective advantages and limitations. The Fully Convolutional Network (FCN) is identified as the foundational breakthrough for deep learning in segmentation, leading to a proliferation of CNN-based architectures that are now transitioning into Transformer-based models.

7.2. Limitations & Future Work

The authors identify several key challenges and future research directions for image segmentation techniques:

Complexity of Segmentation Tasks: Semantic segmentation, instance segmentation, and panoramic segmentation remain active research hotspots. Panoramic segmentation is particularly challenging due to the need to simultaneously recognize countable instances and uncountable stuff regions within a single workflow, requiring robust networks to handle both large inter-category and small intra-category differences.
3D Data Segmentation: With the rise of 3D acquisition equipment (e.g., LiDAR cameras), RGB-depth, 3D-point clouds, voxels, and mesh segmentation are gaining importance for applications like face recognition, autonomous vehicles, VR/AR, and architectural modeling. However, the representation and processing of inherently unstructured, redundant, disordered, and unevenly distributed 3D data pose significant challenges.
Data Scarcity and Annotation Limitations: Many fields suffer from a lack of large, fine-grained annotated datasets, hindering the training of supervised deep learning algorithms. Future work needs to explore semi-supervised, unsupervised, transfer learning, and few-shot image semantic segmentation approaches, which can learn effectively from limited labeled samples. Reinforcement learning is also noted as a possible, though less explored, solution.
Computational Efficiency and Real-time Performance: Deep learning networks demand substantial computing resources during training and inference. Achieving real-time (e.g., >25 fps for video processing) segmentation is a critical requirement for many applications. Balancing model accuracy with real-time performance remains a significant challenge, despite progress with lightweight networks.
Explicability of Deep Learning: The "black box" nature of deep learning models limits their robustness, reliability, and performance optimization in critical downstream tasks. Improving model interpretability is a crucial long-term goal.

7.3. Personal Insights & Critique

This review paper provides an excellent, structured overview of image segmentation, which is highly beneficial for beginners in the field. The chronological categorization into classic, co-segmentation, and deep learning-based methods effectively illustrates the progression of research and the increasing complexity of techniques. The detailed breakdown of deep learning architectures, including encoder-decoder, skip connections, dilated convolution, multiscale feature extraction, and attention mechanisms, is particularly helpful for understanding the building blocks of modern segmentation models. The authors' discussion of challenges and future trends is insightful, accurately reflecting the current research landscape, especially the pivotal shift towards Transformers.

One area for potential improvement, while understandable given the breadth of a review, is that some classic segmentation algorithms are described conceptually without delving into their mathematical formulations. For instance, active contour energy functions or graph cut energy minimization could have been presented with their core equations to offer a deeper technical understanding of their mechanics, even if simplified. However, for a beginner-friendly overview, the current level of detail is appropriate.

The paper's strong emphasis on the evolution from CNNs to Transformers is timely and relevant, as Transformers are indeed reshaping the computer vision landscape. The identified limitations, such as the challenges of 3D data and the need for unsupervised/semi-supervised learning due to data scarcity, point to crucial directions for future innovation. The problem of deep learning explicability is also a critical, cross-disciplinary challenge that segmentation research, like many other AI fields, must confront to achieve broader trust and deployment in sensitive applications such as medical imaging. Overall, this paper serves as a valuable resource for navigating the diverse and rapidly evolving field of image segmentation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Techniques and Challenges of Image Segmentation: A Review

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 49,029 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Classic Segmentation Methods

3.2.2. Co-Segmentation Methods

3.2.3. Semantic Segmentation Based on Deep Learning

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Classic Segmentation Methods

4.1.1. Edge Detection

4.1.2. Region Division

4.1.3. Graph Theory

4.1.4. Clustering Method

4.1.5. Random Walks

4.2. Co-Segmentation Methods

4.2.1. MRF-Based Co-Segmentation

4.2.2. Co-Segmentation Based on Random Walks

4.2.3. Co-Segmentation Based on Active Contours

4.2.4. Clustering-Based Co-Segmentation

4.2.5. Co-Segmentation Based on Graph Theory

4.2.6. Co-Segmentation Based on Thermal Diffusion

4.2.7. Object-Based Co-Segmentation

4.3. Semantic Segmentation Based on Deep Learning

4.3.1. Encoder-Decoder Architecture

4.3.2. Skip Connections

4.3.3. Dilated Convolution

4.3.4. Multiscale Feature Extraction

4.3.5. Attention Mechanisms

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers