Techniques and Challenges of Image Segmentation: A Review
TL;DR Summary
This paper reviews advancements in image segmentation, categorizing techniques into classic, collaborative, and deep learning-based semantic segmentation. It highlights challenges in feature extraction and model design while analyzing key algorithms, their applicability, and futu
Abstract
Image segmentation, which has become a research hotspot in the field of image processing and computer vision, refers to the process of dividing an image into meaningful and non-overlapping regions, and it is an essential step in natural scene understanding. Despite decades of effort and many achievements, there are still challenges in feature extraction and model design. In this paper, we review the advancement in image segmentation methods systematically. According to the segmentation principles and image data characteristics, three important stages of image segmentation are mainly reviewed, which are classic segmentation, collaborative segmentation, and semantic segmentation based on deep learning. We elaborate on the main algorithms and key techniques in each stage, compare, and summarize the advantages and defects of different segmentation models, and discuss their applicability. Finally, we analyze the main challenges and development trends of image segmentation techniques.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Techniques and Challenges of Image Segmentation: A Review
1.2. Authors
Ying Yu, Chunping Wang, Qiang Fu, Renke Kou, Fuyu Huang, Boxiong Yang, Tingting Yang, and Mingliang Gao. Their affiliations span multiple institutions:
- Department of Electronic and Optical Engineering, Army Engineering University of PLA, Shijiazhuang 050003, China
- School of Information and Intelligent Engineering, University of Sanya, Sanya 572022, China
- School of Electrical and Electronic Engineering, Shandong University of Technology, Zibo 255000, China
1.3. Journal/Conference
The paper was published in Electronics, a peer-reviewed open access journal published by MDPI. Electronics covers a broad range of topics related to the science and technology of electronics, electrical engineering, and communications. MDPI journals are generally recognized in the academic community, though their impact factors can vary. In the field of computer vision and image processing, it is a recognized venue.
1.4. Publication Year
2023
1.5. Abstract
Image segmentation, a crucial step in natural scene understanding, involves dividing an image into meaningful, non-overlapping regions and is a prominent research area in image processing and computer vision. Despite significant advancements over decades, challenges persist in feature extraction and model design. This paper offers a systematic review of the progress in image segmentation methods. It categorizes the evolution into three main stages based on segmentation principles and image data characteristics: classic segmentation, collaborative segmentation (or co-segmentation), and semantic segmentation based on deep learning. For each stage, the authors detail the primary algorithms and key techniques, compare and summarize their advantages and disadvantages, and discuss their applicability. The review concludes with an analysis of current challenges and future development trends in image segmentation techniques.
1.6. Original Source Link
/files/papers/69299a334015f90af7cc618f/paper.pdf (This link points to the PDF hosted on a local file system; the abstract and citation provide the official DOI: https://doi.org/10.3390/electronics12051199, which is the officially published version).
2. Executive Summary
2.1. Background & Motivation
Image segmentation is a fundamental problem in computer vision and image processing, serving as a prerequisite for pattern recognition and image understanding. It involves partitioning an image into several meaningful and non-overlapping regions, often with the goal of extracting regions of interest (ROIs). The importance of image segmentation is underscored by its wide applications in fields such as autonomous vehicles, intelligent medical technology, image search engines, industrial inspection, and augmented reality.
Despite continuous research since the 1970s, image segmentation remains a challenging ill-posed problem due to two main difficulties:
-
Defining "meaningful regions": Human visual perception and comprehension are diverse and subjective, leading to ambiguity in what constitutes a "meaningful object."
-
Effectively representing objects: Digital images are composed of
pixels, which carry low-level local features (color, texture). Obtaining global information (e.g., shape, position) from these local attributes is difficult.Furthermore, traditional
feature extractionmethods based on manual or heuristic rules struggle to meet the complexity of modern image segmentation demands, especially with the increasing detail and diversity (scale, posture) of images from advanced acquisition equipment. This necessitates models with highergeneralization ability. While previous reviews have covered semantic segmentation methods [5,7] or evaluation metrics [8], there was a perceived gap in systematically summarizing the evolution of image segmentation algorithms from its early stages to the present day.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Systematic Review of Evolution: It provides a comprehensive and chronological review of image segmentation methods, categorizing them into three distinct stages:
classic segmentation,co-segmentation, andsemantic segmentation based on deep learning. This evolutionary perspective helps in understanding the progression of techniques. -
Detailed Algorithm Elaboration: For each stage, the paper elaborates on the main algorithms and key techniques, explaining their working mechanisms and enumerating influential examples.
-
Comparative Analysis: It compares and summarizes the advantages, defects, and applicability of different segmentation models, offering insights into their strengths and weaknesses.
-
Identification of Challenges and Trends: The paper analyzes the current main challenges in image segmentation, particularly for deep learning-based methods (e.g., limited annotations, class imbalance, overfitting, long training times, gradient vanishing, computational complexity, and explicability), and discusses future development trends, highlighting the shift from
Convolutional Neural Networks (CNNs)toTransformers. -
Focus on Deep Learning Architectures: It systematically introduces essential techniques of
semantic segmentationbased ondeep neural networks, includingencoder-decoder architectures,skip connections,dilated convolution,multiscale feature extraction, andattention mechanisms.The key conclusions are that image segmentation has evolved from
coarse-grainedtofine-grainedanalysis, frommanual feature extractiontoadaptive learning, and fromsingle-image-orientedtobig-data-based common feature segmentation. Deep neural networks, especiallyTransformers, are identified as the leading future direction, despite challenges related to data requirements, computational cost, and model interpretability.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Image Segmentation: The overarching goal of dividing a digital image into multiple segments (sets of
pixels). The goal is to simplify and/or change the representation of an image into something more meaningful and easier to analyze. Each segment should be homogeneous in some characteristic (e.g., color, intensity, texture) and distinct from adjacent segments. - Computer Vision: A field of artificial intelligence that trains computers to "see" and interpret visual information from the world, much like humans do. Image segmentation is a core task within computer vision.
- Image Processing: The manipulation of digital images using algorithms, often to improve their quality, extract information, or prepare them for further analysis. Segmentation is a form of image processing.
- Pattern Recognition: A field within machine learning that focuses on the automatic recognition of patterns and regularities in data. Image segmentation provides meaningful regions that can then be used for pattern recognition (e.g., recognizing an object within a segmented region).
- Pixels: The smallest individual element in a digital image. Each pixel contains a value representing color and/or intensity at a specific location.
- Superpixels: Clusters of
pixelsthat share similar visual properties (e.g., color, texture, intensity) and are spatially proximate.Superpixelsare often used as a preprocessing step in segmentation to reduce computational complexity by grouping redundant pixels into perceptually meaningful atomic regions. - Grayscale Images: Images composed of shades of gray, typically ranging from black (0 intensity) to white (255 intensity). Each
pixelhas a single intensity value. - Color Images: Images that contain color information, typically represented using three channels (e.g., Red, Green, Blue in an RGB image). Each
pixelhas three intensity values, one for each channel. - Neural Networks (NNs): A class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers, which process data through computations.
Deep learningrefers toneural networkswith many layers. - Fully Connected Layers (FC layers): In traditional
neural networks,fully connected layersconnect every neuron in one layer to every neuron in the next layer. These layers are common in the output stages ofimage classificationnetworks but require fixed-size inputs. - Convolutional Neural Networks (CNNs): A specialized type of
neural networkparticularly effective for processing grid-like data such as images. They useconvolutional layersto automatically and adaptively learn spatial hierarchies of features from input images.
3.2. Previous Works
The paper organizes previous works into classic segmentation, co-segmentation, and semantic segmentation based on deep learning.
3.2.1. Classic Segmentation Methods
These methods primarily focus on analyzing single images, often relying on low-level features and requiring human intervention or prior knowledge.
- Edge Detection: Identifies boundaries where image intensity or color changes sharply.
- Differential Operators (Sobel, Canny, Laplacian, Roberts, Kirsch): These operators calculate the gradient of image intensity. For example, the
Sobel operatorcomputes an approximation of the gradient of the image intensity function to highlight regions of high spatial frequency, which often correspond to edges. TheCanny operatoris known for its robustness to noise and ability to produce thin, continuous edges. - Active Contours (Snakes): These methods define a deformable curve (contour) that iteratively moves towards object boundaries by minimizing an energy function. The energy function typically includes internal energy (to maintain smoothness and continuity of the curve) and external energy (to pull the curve towards image features like gradients).
- Graph Cuts: Models the image as a graph where
pixelsare nodes and edge weights represent similarity. Segmentation is achieved by finding a minimum cut in the graph, separating foreground and background. It's often used interactively.
- Differential Operators (Sobel, Canny, Laplacian, Roberts, Kirsch): These operators calculate the gradient of image intensity. For example, the
- Region Division: Groups
pixelsinto regions based on similarity.- Thresholding: Segments an image by setting a threshold value.
Pixelsbelow the threshold are assigned to one region (e.g., background), and those above to another (e.g., foreground).Otsu's methodis a common technique for automatically determining an optimal threshold. - Region Growing: Starts from a
seed pixeland expands the region by adding neighboringpixelsthat satisfy a predefined similarity criterion (e.g., similar intensity or color). - Watershed Algorithm: Treats the image as a topographic landscape where intensity values represent altitudes. It simulates water rising from local minima, forming "dams" at watershed lines, which represent the segmentation boundaries.
- Thresholding: Segments an image by setting a threshold value.
- Graph Theory: Represents images as graphs to leverage graph algorithms for segmentation.
- Markov Random Fields (MRF): A
probabilistic graphical modelused to model the dependencies betweenpixelsin an image. Eachpixelis a node, and its label (foreground/background) depends on its neighbors. Segmentation involves finding the most probable labeling. - Minimum Spanning Tree (MST): Used for region merging where
pixelsorsuperpixelsare nodes, and edge weights represent dissimilarity. Merging criteria can be applied to edges of theMST.
- Markov Random Fields (MRF): A
- Clustering: Groups
pixelsin a feature space (e.g., color, intensity, location) into clusters, where each cluster corresponds to a segment.- K-means Clustering: An iterative algorithm that partitions data points into clusters. It assigns each data point to the cluster with the nearest mean (centroid) and then updates the centroids.
- Mean-shift: A non-parametric clustering algorithm that finds modes (peaks) in the data density. It iteratively shifts each data point towards the mean of data points in its neighborhood.
- Simple Linear Iterative Clustering (SLIC): An algorithm that efficiently generates
superpixelsby applyingK-means clusteringin a 5D feature space (L, a, b color values, and x, y pixel coordinates).
3.2.2. Co-Segmentation Methods
These methods extend classic segmentation by finding common foreground objects across a set of images, often without explicit annotation, falling under semi-supervised or weakly supervised paradigms. This leverages shared information between images to improve robustness.
- MRF-Based Co-Segmentation [24,25,26,27,28,29,30]: Extends
MRFsegmentation by adding aco-segmentation term() to the energy function, penalizing inconsistency of foreground features across multiple images. - Random Walks-Based Co-Segmentation [33,34,35]: Extends the
random walksmodel to multiple images, often by incorporating global terms orsupervoxelsfor 3D data. - Active Contours-Based Co-Segmentation [38,39,40]: Adapts
active contourmodels by constructing energy functions that consider foreground consistency across images and background inconsistency within each image. - Clustering-Based Co-Segmentation [41,42,43]: Applies
spectral clusteringordiscriminative clusteringto groupsuperpixelsor local regions, then propagates segmentation results across image sets. - Graph Theory-Based Co-Segmentation [44,45]: Builds
digraphswhere nodes represent local regions (e.g.,superpixels), and edges represent similarity and saliency, converting co-segmentation to a shortest path problem. - Thermal Diffusion-Based Co-Segmentation [46,47]: Utilizes concepts from
anisotropic diffusionandtemperature maximizationto achieve multi-category co-segmentation by maximizing segmentation confidence. - Object-Based Co-Segmentation [48,49,50]: Measures similarity between candidate foreground objects or uses object detection to identify common objects across images.
3.2.3. Semantic Segmentation Based on Deep Learning
This stage leverages deep neural networks to learn complex feature representations and perform end-to-end segmentation, achieving state-of-the-art performance, especially with large annotated datasets.
- Patch Classification [53]: An early approach where small image patches were fed into a
neural network(typically aCNN) for classification, and the central pixel of the patch was labeled according to the patch's class. This was computationally expensive and ignored context. - Fully Convolutional Networks (FCN) [54]: Pioneered
end-to-end semantic segmentationby replacing thefully connected layersof traditionalCNNswithconvolutional layers, allowing for arbitrary input image sizes and producing a spatial output map. This laid the foundation for modern deep learning-based segmentation. - U-Net [64]: An
encoder-decoder architecturewith crucialskip connectionsthat concatenate feature maps from theencoderto corresponding layers in thedecoder, preserving fine-grained spatial information often lost during downsampling. It was originally designed for medical image segmentation. - SegNet [61]: Another
encoder-decoder architecturethat transfersmax-pooling indicesfrom theencoderto thedecoderduring upsampling, allowing for more precise boundary recovery. - DeepLab Series [65,66,67,68]: A family of models that introduced
atrous convolution(also known asdilated convolution) to enlarge the receptive field without increasing computational cost or losing resolution. They also incorporatedAtrous Spatial Pyramid Pooling (ASPP)for multiscale feature extraction andConditional Random Fields (CRFs)for boundary refinement (in earlier versions). - PSPNet (Pyramid Scene Parsing Network) [72]: Employs a
Pyramid Pooling Module (PPM)to aggregatecontext informationfrom different regions and at various scales, capturing both local and global features. - Attention Mechanisms [78,79,81,83,85,86,87]: Integrate
attentionto allow models to selectively focus on relevant parts of the input, improving long-range dependency modeling and feature representation. Examples includeRNNandLSTMbased approaches,Attention U-Net, andself-attentionmodules. - Transformers (ViT, Swin Transformer) [92,93]: Originally from
Natural Language Processing (NLP),Transformersare entirely based onself-attention mechanisms.Vision Transformer (ViT)appliesTransformersto image recognition by treating image patches as sequences.Swin Transformerintroduced ashifted windowing schemeto build hierarchical feature maps and efficiently computeself-attentionlocally while allowing for cross-window connections.
3.3. Technological Evolution
The technological evolution of image segmentation can be broadly summarized by three transitions:
- From low-level features to high-level semantics: Early methods focused on basic
pixelproperties like intensity and color, while moderndeep learningapproaches learn hierarchical, abstract features that capture semantic meaning. - From single-image processing to multi-image context:
Classic segmentationtreated images in isolation.Co-segmentationrecognized the value of common information across image sets.Deep learningimplicitly learns generalizable features from vast datasets. - From manual/heuristic design to adaptive learning:
Classic segmentationoften required manually crafted features or rule-based algorithms.Deep learningmodels, especiallyCNNsandTransformers, adaptively learn features and segmentation rules directly from data. - From fixed-size inputs to arbitrary-size inputs (FCN): Traditional
CNNsfor classification hadfully connected layersrequiring fixed input sizes.FCNsmadeend-to-end semantic segmentationpossible for images of any dimension. - From CNN-centric to Transformer-centric (Current): While
CNNsdominated for years,Transformersare now gaining significant traction due to their ability to model long-range dependencies effectively, potentially leading to new breakthroughs.
3.4. Differentiation Analysis
This paper differentiates itself from previous reviews (e.g., [5,7,8]) by:
- Evolutionary Perspective: It explicitly structures the review around the historical development of image segmentation, from its classic roots through
co-segmentationto moderndeep learning, providing a coherent narrative of how the field has progressed. - Comprehensive Coverage: It aims to cover the full spectrum of image segmentation, not just
semantic segmentationor specific aspects like evaluation metrics. - Systematic Categorization: The paper categorizes methods based on underlying principles and data characteristics (
classic,collaborative,deep learning-based semantic), offering a clear framework for understanding the diverse techniques. - Focus on Key Techniques within Deep Learning: Beyond just listing models, it breaks down the core technical innovations within
deep learningarchitectures (e.g.,encoder-decoder,skip connections,dilated convolution,multiscale feature extraction,attention mechanisms), which is crucial for a beginner-friendly deep dive.
4. Methodology
The paper systematically reviews image segmentation methods across three historical stages: classic segmentation, co-segmentation, and semantic segmentation based on deep learning. Each stage represents a significant shift in the underlying principles, feature extraction techniques, and the type of image data processed.
4.1. Classic Segmentation Methods
Classic segmentation algorithms were primarily developed for grayscale images and focused on identifying regions based on gray-level similarity (for region division) or gray-level discontinuity (for edge detection). For color images, they often involved an initial step of segmenting into superpixels based on pixel similarity, followed by merging.
4.1.1. Edge Detection
Edge detection aims to locate points where gray-level changes sharply, indicating boundaries between different regions. These are also known as parallel boundary techniques.
-
Differential Operators: These operators approximate the derivative of the
gray-levelfunction to identify sharp changes.-
Sobel Operator: Computes the gradient magnitude using a pair of convolution kernels, one for horizontal (
SobelX) and one for vertical (SobelY) intensity gradients. An example is shown in Figure 2. -
Kirsch Operator: Uses 8 directional convolution kernels to detect edges in various orientations.
-
Roberts Cross Operator: A simple operator for detecting edges, typically applied diagonally.
-
Canny Operator: A multi-stage algorithm involving
Gaussian smoothing(for noise reduction),gradient magnitudeanddirection calculation,non-maximum suppression(to thin edges), andhysteresis thresholding(to connect edges). It is noted for its strong denoising ability and good continuity and fineness of edges. -
Laplacian Operator: A second-order derivative operator that detects zero-crossings in the second derivative, often used to find locations of rapid intensity change.
-
Limitations: Differential operators are sensitive to noise and may result in discontinuous boundaries, especially in high-detail regions. Pre-smoothing is often required. The
Canny operatoris more complex but performs better.The following figure (Figure 2 from the original paper) shows the edge detection results of different differential operators.
该图像是图表,展示了不同微分算子的边缘检测结果。其中(a)为原图,(b)至(h)分别为SobelX、SobelY、Sobel、Kirsch、Roberts、Canny与Laplacian的处理结果。
-
-
Serial Boundary Techniques: These methods connect individual edge points to form closed boundaries.
- Graph-searching algorithms: Represent edge points as nodes in a graph and search for minimum-cost paths that form closed boundaries. These are computationally intensive.
- Dynamic Programming (DP): Utilizes heuristic rules to reduce the computational cost of graph searching.
-
Active Contours (Snakes): This method approximates object contours by evolving an initial closed curve (
initial contours) towards local image features (e.g., strong gradients). It minimizes anenergy functionthat typically balances internal energy (for curve smoothness) and external energy (for alignment with image features).- Limitations: Sensitive to the initial contour's placement and prone to local minima, making it difficult to converge to concave boundaries. Lankton and Tannenbaum [9] proposed a framework using
local segmentation energyto address some of these issues.
- Limitations: Sensitive to the initial contour's placement and prone to local minima, making it difficult to converge to concave boundaries. Lankton and Tannenbaum [9] proposed a framework using
-
Graph Cuts: An interactive segmentation method where
pixelsare represented as nodes in a graph.Source nodesare marked as foreground, andsink nodesas background. Edge weights between nodes representpixelsimilarity or fit to foreground/background. Segmentation is achieved by finding amin-cut(a set of edges whose removal disconnects source from sink with minimum total weight), which corresponds to minimizing anenergy function. It's anNP-hard problem, requiring efficient approximation algorithms likeswaporexpansionalgorithms. Freedman [10] combinedgraph cutswith shape prior knowledge to improve accuracy.
4.1.2. Region Division
Region division strategies group pixels into homogeneous regions.
- Thresholding: A
parallel region divisionalgorithm that segments an image based ongray-levelvalues. An optimalgrayscale thresholdis determined, often by analyzing thegray histogramto maximize the discriminability between categories (e.g., usingzeroth-orderorfirst-order cumulant moments). - Serial Region Techniques: These involve multiple sequential steps for region segmentation.
- Region Growing: Starts with
seed pixelsand iteratively adds neighboringpixelsthat share similar features (e.g.,gray value) until no morepixelscan be merged. - Region Merging: Similar to
region growing, but typically starts with many small regions (e.g.,superpixels) and merges adjacent regions if their similarity (e.g., difference in averagegray value) is below a threshold. It can handle noise and occlusion but has high computational cost and difficulties in defining stopping rules.
- Region Growing: Starts with
- Watershed Algorithm: Based on topographic concepts. It divides an image by simulating water rising from local minima, forming "dams" at the
watershed lines(segmentation boundaries). It provides closed contours and is efficient but prone tofalse segmentationin complex images (over-segmentation). This can be mitigated by using aGaussian mixture model (GMM). It's effective for medical images with overlapping cells. - Superpixels:
Superpixelsare small, irregular regions composed ofpixelswith similar properties (e.g., brightness, color, texture, position). They serve as a preprocessing step to reduce image complexity by operating on perceptual regions rather than individualpixels. Methods includeclusteringandgraph theory.
4.1.3. Graph Theory
Image segmentation based on graph theory maps an image to a weighted graph.
Pixelsorregionsbecomevertices, and the similarity between them becomesedge weights.- Segmentation is then framed as a problem of dividing
verticesin the graph to obtain optimal segmentation (e.g.,min-cut). - Graph-based Region Merging: Uses
metricsto achieve optimal global grouping. Felzenszwalb et al. [11] used aminimum spanning tree (MST)to mergepixelsafter image representation as a graph. - Markov Random Fields (MRF): Introduces
probabilistic graphical models (PGMs)to represent the randomness oflower-level features. Anundigraphis built where eachvertexis a feature, and eachedgerepresents a relationship. TheMarkov propertystates that a feature at any point is only related to its adjacent features. Segmentation seeks to find the most likely labeling ofpixelsbased on this probabilistic model. - Spectral Graph Partitioning: Leordeanu et al. [12] used
spectral graph partitioningto find correspondence between feature sets, building anadjacency matrixand using its principal eigenvectors to recover correct assignments.
4.1.4. Clustering Method
Clustering algorithms group pixels or superpixels into segments based on feature similarity.
-
K-means Clustering: An iterative algorithm that partitions data into clusters.
- Algorithm:
- Initialize points as
cluster centers. - Calculate the distance between each
pixeland theK cluster centers, assigning eachpixelto the cluster with the minimum distance. - Recalculate the
cluster centersby averaging thepixelsin each cluster (moving the center to thecentroid). - Repeat steps 2 and 3 until the algorithm converges (cluster assignments no longer change significantly).
- Initialize points as
- Advantages: Noise robustness, quick convergence.
- Limitations: Not suitable for non-adjacent regions, converges only to local optimum, sensitive to initial center selection.
- Algorithm:
-
Mean-shift [13]: A
density estimation-basedclustering algorithmthat models the image feature space as aprobability density functionand finds modes (high-density areas). -
Fuzzy C-means [14]: An extension of
K-meansthat assignspixelsto clusters with a degree of membership, allowingpixelsto belong to multiple clusters simultaneously, and integrates spatial information. -
Spectral Clustering: A
graph theory-basedclustering methodthat divides aweighted graphintosubgraphswith low coupling and high cohesion. -
Simple Linear Iterative Clustering (SLIC) [15]: Uses
K-meansin a 5D space (L, a, b color channels and x, y coordinates) to efficiently generatesuperpixels. The results are shown in Figure 3. -
Linear Spectral Clustering (LSC) [16]: A
superpixel segmentation algorithmthat mapspixelcoordinates and values into a high-dimensional space using akernel function, then weights points to obtain an optimal solution aligned withK-meansandnormalized cutobjectives.The following figure (Figure 3 from the original paper) shows SLIC segmentation results (number of superpixels: 10, 20, 50, and 100).
该图像是图表,展示了SLIC分割结果,分别使用了10、20、50和100个超像素进行图像分割。每个子图展示了不同数量超像素下的分割效果,表明随着超像素数量的增加,图像细节的捕捉程度和分割效果也有所不同。
4.1.5. Random Walks
Random walks is a graph theory-based segmentation algorithm that assigns labels to pixels based on the probability of a random walker reaching pre-marked foreground or background seed points.
- Grady et al. [20] transformed segmentation into a
discrete Dirichlet problem. The image is converted into aconnected weighted undigraph. Foreground and background are marked withseed points. For unmarkedpixels, the algorithm calculates the probability of arandom walkstarting from thatpixelreaching a foregroundseedor a backgroundseedfirst. Thepixelis then assigned the category with the highest probability. - Yang et al. [21] proposed a
constrained random walksalgorithm where user input (foreground/background scribbles, hard/soft boundary constraints) guides the segmentation. - Lai et al. [22] extended
random walksto3D mesh images, definingedge weightsbydihedral anglesbetween adjacent faces. Zhang et al. [23] improved this with afast geodesic curvature flow (FGCF)algorithm, reducingverticesand smoothing contours.
4.2. Co-Segmentation Methods
Co-segmentation (or collaborative segmentation), introduced by Rother et al. [24] in 2006, aims to extract common foreground regions from a set of images without human intervention, leveraging prior knowledge shared across the images. This approach is semi-supervised or weakly supervised.
The general extended model for co-segmentation is expressed by an energy function :
$
E = E_s + E_g
$
Where:
- : Represents the
energy functionforseed image segmentation. This term describes the difference between the foreground and background within a single image and the smoothness of its segmentation. It's essentially the energy function from classic segmentation methods. - : Represents the
energy functionforco-segmentation. This term describes the similarity between the foregrounds across the entire set of images, enforcing consistency. The goal is to minimize to achieve good co-segmentation.
4.2.1. MRF-Based Co-Segmentation
Rother et al. [24] extended the MRF segmentation by utilizing prior knowledge to solve ill-posed problems in multiple image segmentation. They first segment a seed image and assume foreground objects across the image set are similar.
The MRF segmentation energy is typically composed of unary potential and pairwise potential:
$
E_s^{MRF} = \mathrm{E}_u^{MRF} + E_p^{MRF}
$
Where:
- : The
unary potentialmeasures the property of apixelitself, representing the probability ofpixelbelonging to class given its feature . - : The
pairwise potentialmeasures the relationship between adjacentpixels, representing the probability that two adjacentpixelsand belong to the same category. Theco-segmentation termis used to penalize inconsistencies in foregroundcolor histogramsacross the image set. - Optimization: Subsequent research focused on optimizing
global constraints. Vicente et al. [25] used amultiscale decompositionfor an extendedBoykov-Jolly model. Rubio et al. [28] introducedhigh-order graph matchingintoMRFforglobal terms. Chang et al. [29] proposed auniversal significance measureto add foregroundpositional information. Yu et al. [30] combined aco-saliency modelwith aGaussian mixture model (GMM)fordissimilaritybetween foreground objects, minimized iteratively viagraph cuts.
4.2.2. Co-Segmentation Based on Random Walks
- Collins et al. [33] extended
random walkstoco-segmentation, optimizing withquasiconvexityand providing aCUDA libraryfor sparse feature operations. - Fabijanska et al. [34] proposed an optimized
random walksfor3D voxel image segmentationusingsupervoxelsto save time and memory. - Dong et al. [35] introduced a
subMarkov random walks (subRW)algorithm withprior label knowledge, effective forslender objects.
4.2.3. Co-Segmentation Based on Active Contours
- Meng et al. [38] extended
active contoursby constructing anenergy functionbased on foreground consistency between images and background inconsistency within images, solved bylevel set methods. - Zhang et al. [39] proposed a
deformable co-segmentation algorithmforbrain MRI segmentation, transformingbrain anatomypriors into constraints and minimizing energy vialevel set. - Zhang et al. [40] introduced image
saliencyintoactive contoursand used alevel set optimizationbased onsuperpixels,hierarchical computing, andconvergence judgment. - Limitations:
Unidirectional movementofactive contourslimits flexibility, making it difficult for objects with weak edges.
4.2.4. Clustering-Based Co-Segmentation
This is an extension of single-image clustering segmentation.
-
Joulin et al. [41] used
spectral clusteringfor single-image segmentation based onlocal spatial information, thendiscriminative clusteringto propagate results across an image set. -
Kim et al. [42] divided images into
superpixels, represented their relevance with aweighted graphandaffinity matrix, and usedspectral clusteringfor co-segmentation. This hierarchical graph clustering is illustrated in Figure 5. -
Joulin et al. [43] used
spectral clusteringbased onfeature positionsandcolor vectorsfor local information, thenexpectation maximization (EM)to minimize aclassification discriminant functionfor multi-object, multi-class co-segmentation.The following figure (Figure 5 from the original paper) shows an illustration of hierarchical graph clustering constructed between two images.
该图像是示意图,展示了基于层次图聚类的图像分割过程。图中左侧和右侧分别显示了两张包含狗的原图,并通过中间的计算步骤表示了从原图到分割图的变换过程。图中的数学公式包括矩阵形式表示的权重 和相关约束条件 ,显示了在不同分割阶段的特征提取和处理。整体结构清晰地展示了图像分割算法的工作流程。
4.2.5. Co-Segmentation Based on Graph Theory
-
Meng et al. [44] constructed a
digraphwhere nodes werelocal regions(from object detection, notpixelsorsuperpixels).Directed edgesrepresentedlocal region similarityandsaliency maps. Co-segmentation became ashortest path problem, solved bydynamic programming (DP). The framework is shown in Figure 6. -
Meng et al. [45] proposed a
co-saliency modelforpairwise-constrained images, extractingdual-constrained saliency maps(single-image and multiple-image saliency) viapairwise-constrained graph matching, solved byDP.The following figure (Figure 6 from the original paper) shows the framework of the co-segmentation based on the shortest path algorithm.
该图像是一个示意图,展示了基于最短路径算法的协同分割框架。图中包括多个局部区域生成、图构建以及最短路径搜索算法的流程,并展示了原始图像、显著性图以及最终的输出结果。这些步骤构成了图像分割过程的重要环节。
4.2.6. Co-Segmentation Based on Thermal Diffusion
- This method maximizes system temperature by changing heat source locations to achieve optimal segmentation.
Anisotropic diffusionis often used for noise reduction while preserving edges. - Kim et al. [46] proposed
CoSand, usingtemperature maximization modelingonanisotropic diffusionto achieve large-scale multi-category co-segmentation by maximizingsegmentation confidence. - Kim et al. [47] achieved multi-foreground co-segmentation by iteratively performing
scene modeling(local feature extraction withspatial pyramid matching,linear SVMfor matching,GMMfor classification) andregion labeling.
4.2.7. Object-Based Co-Segmentation
-
Alexe et al. [48] quantified the possibility of an image window containing objects of any category, using
Bayesian theoryto find high-scoring windows as feature calibration. -
Vicente et al. [49] measured similarity between foreground objects, extracting top-scoring features from multiple candidate classes.
-
Meng et al. [50] proposed a multi-group image
co-segmentation frameworkusingMRFand adense mapping model, solved byEM, to achievemulti-foreground recognitionby generating accurateprior knowledge.The following are the results from Table 1 of the original paper, comparing and analyzing main co-segmentation methods:
Methods Ref. Foreground Feature Co-Information Optimization MRF-Based Co-Segmentation [24] color histogram L1 norm graph cuts [26] color histogram L2 norm quadratic pseudo-Boolean [27] color and texture histograms reward model maximum flow [25] color histogram Boykov—Jolly model dual decomposition [46] color and SIFT features region matching graph cuts [29] SIFT feature K-means + L1, 2 graph cuts [48] SIFT feature Gaussian mixture model (GMM) constraint graph cuts [33] color and texture histograms improved random walk global term gradient projection and conjugate gradient (GPCG) Co-Segmentation Based on Random Walks [34] intensity and gray difference improved random walk global term graph size reduction [35] label prior from user scribbles GMMs minimize the average reaching probability [38] color histogram reward model level set function Co-Segmentation Based on Active Contours [39] co-registered atlas and statistical features k-means level set function [40] saliency information improved Chan-Vese (C-V) model level set function [41] SIFT, Gabor filter, color histogram Chi-square distance low-rank Clustering-Based Co-Segmentation [43] color and location information discriminant clustering expectation maximization (EM) [42] pyramid of LAB colors, HOG textures, SURF features histogram hierarchical clustering normalized cut criterion [44] color histogram built digraphs according to region similarity and saliency shortest path Co-Segmentation based on Graph Theory [45] color and shape information build global items based on digraphs and saliency shortest path [46] lab space color and texture information Gaussian consistency Sub-modularity optimization Co-Segmentation Based on Thermal Diffusion [47] color and texture histograms GMM & SPM (spatial pyramid matching) dynamic programming [48] multi-scale saliency, color contrast, edge density and Bayesian framework maximizing the posterior probability Object-Based Co-Segmentation [49] superpixels straddling 33 types of features random forest classifier A-star search algorithm
4.3. Semantic Segmentation Based on Deep Learning
With the increased complexity of images and computational power, deep learning methods have become dominant. Early approaches like patch classification [53] were limited, but Fully Convolutional Networks (FCNs) [54] revolutionized the field by enabling end-to-end semantic segmentation for arbitrary image sizes.
The following figure (Figure 7 from the original paper) shows the Fully Convolutional Networks architecture.
该图像是一个示意图,展示了全卷积网络的架构,其中包含输入模块、编码器模块、上采样过程以及输出。该结构通过跳跃连接对不同层的预测结果进行整合,以生成最终的分割输出。
4.3.1. Encoder-Decoder Architecture
This architecture is fundamental to many modern semantic segmentation networks, building upon FCNs.
- Encoder Stage: Typically composed of
convolutionalandpoolingoperations.- Convolution Operation: Involves sliding a
convolutional kernel(a small matrix of weights) over the image, performing element-wise multiplication with thepixelsin the receptive field, summing the results, and applying anactivation function(e.g., ReLU) to produce afeature map. This extracts hierarchical features. - Pooling Operation: Downsamples
feature mapsto reduce spatial dimensions and computation, while retaining important information.Max-poolingselects the maximum value in a pooling window. - Backbone Networks: Pre-trained
image classification CNNs(e.g.,VGG[57],Inception[58,59],ResNet[60]) are commonly used as encoders to extract high-dimensional semantic features.
- Convolution Operation: Involves sliding a
- Decoder Stage: Aims to generate a
semantic segmentation maskby mapping the high-dimensional features back to the original image size. This involvesup-sampling.- Interpolation: A simple
up-samplingmethod that inserts newpixelsbetween existing ones using strategies likebilinearorbicubic interpolation. It does not require learned parameters. - Deconvolution (Transposed Convolution): Also known as
transposed convolution, this operation reverses the process of convolution, effectively expanding the spatial resolution of thefeature mapby learning parameters.FCNsutilizedeconvolutionforup-sampling. - Unpooling: Used in networks like
SegNet[61]. During themax-poolingoperation in theencoder, theindices(locations) of the maximum values within each pooling window are recorded.Unpoolinguses theseindicesto place themax-pooledvalues back into their original positions in thedecoder, setting other positions to zero, thus preserving boundary information. - Dense Up-sampling Convolution (DUC) [62]: Converts
label mappingin afeature mapinto smallerlabel mappingswith multiple channels, achieved directly by convolutions without extra interpolation.
- Interpolation: A simple
4.3.2. Skip Connections
Skip connections (or shortcut connections) address the degradation problem in deep networks (where performance decreases with depth) and improve pixel positioning by providing direct pathways for information flow.
-
ResNet [60] and DenseNet [63]: Introduced
skip connectionsto allow gradients to flow more easily through the network, helping to train very deep models. -
U-Net [64]: Proposed a novel
long skip connectionarchitecture, as shown in Figure 8. It concatenatesfeature mapsfrom theencoder(which contain fine-grained spatial details) to correspondingdecoderlayers (which contain high-level semantic information). This fusion helps thedecoderto produce precise segmentations with accurate boundaries.U-Netwas initially designed forbiomedical image segmentationand is widely adopted inmedical image analysis.The following figure (Figure 8 from the original paper) shows the U-Net architecture.
该图像是U-Net架构的示意图,展示了输入图像的处理流程及输出分割图的生成。该网络通过多层卷积和池化操作,逐步提取图像特征,最后生成对应的分割图。
4.3.3. Dilated Convolution
Dilated convolution (also known as atrous convolution) expands the receptive field of convolutional kernels without increasing the number of parameters or losing spatial resolution. This is achieved by inserting "holes" or gaps between kernel elements.
- DeepLab V1 [65]: Replaced
max-pooling layerswithdilated convolutionto maintain high resolution offeature mapsand address the loss oftransfer invariance. It also usedfully connected Conditional Random Fields (CRFs)for post-processing to refine segmentation boundaries and capture multi-scale context. - Multi-Scale Context Aggregation [69]: Yu et al. [69] used
dilated convolutionto aggregatemultiscale context informationwith a context module applying3x3 convolutional kernelswith varyingdilation factors(e.g., [1, 1, 2, 4, 8, 16, 1]). - Dilated Residual Network (DRN) [70]: Based on
ResNet,DRNremoved downsampling in laterconvolutional groups(G4 and G5) and instead applieddilated convolutionswith rates and respectively to maintain spatial resolution. - Hybrid Dilated Convolution (HDC) [62]: Proposed to address the
gridding problem(whendilated convolutionswith the same rate cause a sparse, checkerboard-like sampling of information).HDCuses differentdilation ratesfor consecutive layers, ensuring that the finalreceptive fieldcompletely covers a square region without holes.
4.3.4. Multiscale Feature Extraction
Capturing multiscale features and context information is crucial for segmenting objects of varying sizes and understanding scene layout.
-
Spatial Pyramid Pooling (SPP) [71]: Introduced to overcome the fixed-size input requirement of
CNNsby pooling features from multiple spatial bins, producing a fixed-length output regardless of input size. It proved effective forsemantic segmentationandobject detection. -
Pyramid Scene Parsing Network (PSPNet) [72]: Utilized a
Pyramid Pooling Module (PPM)to extract and aggregate features at different scales, combining local and global context information. As shown in Figure 9,PPMtakes the finalfeature mapfrom abackbone network(e.g.,ResNet), appliespoolingat various scales,convolutionsto reduce dimensionality, and thenup-samplesandconcatenatesthese features. The number of pyramid layers and their sizes are adaptable.The following figure (Figure 9 from the original paper) shows the PSPNet with the pyramid pooling module.
该图像是一个示意图,展示了PSPNet与金字塔池化模块的结构。图中包含输入图像、特征图、金字塔池化模块及最终预测的过程,揭示了模型在图像分割中的工作流程。 -
DeepLab V2 [66]: Introduced
Atrous Spatial Pyramid Pooling (ASPP)to capture multiscale features by applying paralleldilated convolutionswith differentdilation rates, as shown in Figure 10. -
DeepLab V3 [67]: Further refined
ASPPby applying bothcascade modulesandparallel modulesofdilated convolution, grouping parallel convolutions, and adding a1x1 convolution layerandbatch normalizationwithinASPP. It significantly improved performance withoutDenseCRFpost-processing. -
DeepLab V3+ [68]: A new
encoder-decoder structureusingDeepLab V3as theencoderandXceptionas thebackbone. It adopteddilated depth-wise separable convolutionsfor efficient feature extraction andbatch normalizationto refine segmentation boundaries.The following figure (Figure 10 from the original paper) shows the Atrous spatial pyramid pooling module.
该图像是一个示意图,展示了Atrous空间金字塔池化模块的结构。图中显示了不同卷积核率(rate)的卷积层,分别为6、12、18和24,配合3x3的卷积核,对输入特征图进行不同的特征提取。 -
Feature Pyramid Network (FPN) [74]: Similar to
U-Net's skip connections,FPNconstructs afeature pyramidwith both high resolution (from shallow layers) and strong semantics (from deep layers), beneficial forobject detectionwith varied object sizes. -
Adaptive Pyramid Context Network (APCNet) [75]: Uses multiple
Adaptive Context Modules (ACMs)to buildmultiscale contextual feature representations. EachACMusesglobal image representationto estimatelocal affinity weightsforsubregionsand calculates optimalcontext vectors. -
Enhanced Feature Pyramid Network (EFPN) [76]: Combines a
Semantic Enhancement Module (SEM),Edge Extraction Module (EEM), andContext Aggregation Module (CAM)in thedecoder, and aGlobal Fusion Module (GFM)in theencoderto improve robustness ofmulti-level feature fusionand capture deep semantic information. -
FPANet (Feature Pyramid Aggregation Network) [77]: A real-time
encoder-decodermodel. TheencoderusesResNetandASPP. Thedecoderuses aSemantic Bidirectional Feature Pyramid Network (SeBiFPN)with alightweight feature pyramid fusion module (FPFM)to fuse semantic and spatial information across different levels.
4.3.5. Attention Mechanisms
Attention mechanisms, originally from Natural Language Processing (NLP), allow models to selectively focus on relevant parts of the input, improving the modeling of dependencies between regions (especially long-distance ones) and channels.
-
RNN-based Attention [78,79]:
Recurrent Neural Networks (RNNs)can model short-term dependencies. TheReSeg network[79] (based onReNet[80]) usesRNNssweeping horizontally and vertically across the image to captureglobal context. TheReSegarchitecture is shown in Figure 11. -
LSTM-based Attention [81,82]:
Long Short-Term Memory (LSTM)networks extendRNNswith memory cells to model long-distance dependencies. Byeon et al. [81] usedLSTMforpixel-for-pixel segmentation. Liang et al. [82] proposedgraph LSTMto enhanceglobal context visual features.The following figure (Figure 11 from the original paper) shows the ReSeg architecture.
该图像是示意图,展示了 ReSeg 架构的层次结构和特征图的处理过程。左侧为输入图像,后续几个立方体表示不同卷积层的特征图,箭头指向特征图的传递关系,最后输出为 32x32 的分割结果。 -
Attention U-Net [83]: Introduced
attention gates (AGs)into theU-Netarchitecture, as shown in Figure 12. Before concatenatingencoder featureswithdecoder features,AG modulessuperviseencoder featuresusingdecoder features, adaptively readjusting output features.AGsgenerate agated signalto suppress irrelevant background regions and highlight salient features. -
Attention UW-Net [84]: Improves
U-Netwithdense skip connectionsand modifiedattention gatesfor medicalchest X-ray images, enhancing attention to salient regions and suppressing background.The following figure (Figure 12 from the original paper) shows the attention U-Net architecture.
该图像是示意图,展示了注意力 U-Net 架构的工作流程,输入图像经过多层卷积处理生成分割图。图中明确标出了各个阶段所用的卷积操作、上采样、最大池化和注意力门等关键技术。该架构通过跳跃连接保留特征信息,提高了分割的准确性和效率。 -
Self-Attention Mechanisms: Used in
encoder networksto model correlations between differentpixelsorchannelswithin a singlefeature map. It computes a weighted sum of pairwise affinities across all positions.- Influential achievements include
PSANet[85],DANet[86],APCNet[75],CARAFE[87], and [88].
- Influential achievements include
-
Transformer [89]: A
deep neural networkbased solely onself-attention, entirely dispensing withconvolutionsandrecurrence.-
Vision Transformer (ViT) [92]: Applied
Transformerstoimage recognition. It divides images into fixed-sizepatches, flattens them into asequence vector, and inputs them into aTransformer encoder(composed ofmulti-head attention layersandmulti-layer perceptrons (MLPs)). TheViTmodel is shown in Figure 13. -
Swin Transformer [93]: Achieved impressive performance in
image semantic segmentationandinstance segmentation. It introduced ashifted windowing approach, calculatingself-attentionwithin local windows and usingcyclic-shifting window partitionsto introducecross-window connectionsbetween neighboring non-overlapping windows. TheSwin Transformerarchitecture is shown in Figure 14.The following figure (Figure 13 from the original paper) shows the ViT model.
该图像是示意图,展示了视觉转换器(ViT)模型的结构和工作流程。图中左侧显示了输入数据经过线性投影和平铺后,将其嵌入到变换编码器中,而右侧展示了变换编码器的主要组成部分,包括多头注意力机制和前馈神经网络的结构。特征提取采用了多层的 MLP 和归一化策略。
-
The following figure (Figure 14 from the original paper) shows the architecture of a swin transformer.
该图像是示意图(a)展示了Swin Transformer的架构,包含多个阶段和模块,其中包括Patch Partition、Swin Transformer Block、Patch Merging等步骤。图中展示了图像数据的处理流程,以及每个阶段的特征表示。
5. Experimental Setup
As a review paper, this document does not present new experimental results but summarizes the experimental setups and outcomes of the reviewed works, particularly in the semantic segmentation section.
5.1. Datasets
The paper mentions and lists various datasets used by the reviewed deep learning models in Table 2. These datasets cover different domains and characteristics:
-
PASCAL VOC (Visual Object Classes) [54,65,66,67,68,70,72,75,76,85,86]: A widely used dataset for
object detection,segmentation,classification, andaction recognition. It contains natural images with annotations for 20 foreground object classes and one background class. -
NYUDv2 [54]: An indoor scene dataset providing
RGB-D(color and depth) images. -
PhC-U373 [64]: A dataset for
phase contrast microscopy imagesofglioblastoma-astrocytoma cells, used inmedical image segmentation. -
DIC-HeLa [64]: A dataset for
differential interference contrast microscopy imagesofHeLa cells, also formedical image segmentation. -
CamVid [61,64,73,79,77]: A
road scene understandingdataset withpixel-level semantic labelsfordriving scenarios, crucial forautonomous vehicles. -
SUN RGBD [61]: A dataset for
3D scene understandingin indoor environments, providingRGB-Dimages andsemantic segmentationlabels. -
Cityscapes [61,62,66,67,70,72,77,86]: A large-scale dataset for
urban street scenes, providingsemantic,instance, andpanoptic annotationsfor understanding driving environments. -
ADE20K [75,85,87,88]: A comprehensive dataset for
scene parsing, containing diverse images with detailedpixel-level annotationsfor various object and stuff categories. -
PASCAL Context [75,86]: Extends
PASCAL VOCby providingpixel-level labelsfor the entire image, including both objects and "stuff" (e.g., grass, sky). -
COCO Stuff [86]: Similar to
PASCAL Context, providespixel-level annotationsfor both objects and "stuff" categories from theCOCO dataset. -
TCIA Pancreas CT-82 [83]: A
medical imaging datasetofCT scansforpancreas segmentation. -
NIH Chest X-ray [84]: A
medical imaging datasetofchest X-ray images, often used for tasks likelung segmentationorpathology detection.These datasets are chosen to validate methods across different segmentation tasks (general natural scenes, medical, autonomous driving, indoor 3D) and
data modalities(RGB,RGB-D,microscopy,CT,X-ray). They are widely accepted benchmarks in their respective domains for evaluating image segmentation performance.
5.2. Evaluation Metrics
The primary evaluation metric used in the context of semantic segmentation results presented in Table 2 is Mean Intersection over Union (mIoU).
- Conceptual Definition:
Intersection over Union (IoU), also known as theJaccard Index, is a standard metric used to evaluate the accuracy of an object detector or segmenter on a particular dataset. It quantifies the overlap between the predicted segmentation mask and the ground truth mask.Mean IoU (mIoU)extends this by calculating theIoUfor each class present in the dataset and then averaging theseIoUvalues over all classes. It provides a robust measure of segmentation quality, balancing the detection of objects and the accuracy of their boundaries. A highermIoUindicates better segmentation performance. - Mathematical Formula:
The
IoUfor a single class is calculated as: $ \mathrm{IoU}_C = \frac{|P_C \cap G_C|}{|P_C \cup G_C|} $ Where:-
: The set of
pixelspredicted as belonging to class . -
: The set of
pixelsactually belonging to class (ground truth). -
: Denotes the cardinality (number of
pixels) of a set. -
: Represents the intersection of two sets (correctly predicted
pixelsfor class ). -
: Represents the union of two sets (all
pixelsthat are either predicted as or are actually ).The
Mean IoU (mIoU)is then calculated by averaging theIoUvalues over all classes: $ \mathrm{mIoU} = \frac{1}{N_c} \sum_{C=1}^{N_c} \mathrm{IoU}_C $ Where: -
: The total number of classes.
-
: The
Intersection over Unionfor class .
-
5.3. Baselines
The paper, being a review, compares various proposed methods against each other and implicitly against their predecessor architectures. For instance, FCN is a foundational baseline for subsequent deep learning models. U-Net, SegNet, and DeepLab series models serve as baselines for more recent CNN-based architectures. The inclusion of diverse methods (e.g., MRF-Based, Random Walks, Active Contours, Clustering, Graph Theory, Thermal Diffusion, Object-Based for co-segmentation) shows a comparison against different algorithmic paradigms. ResNet and VGG are frequently mentioned as backbone networks, implying that models using these as components are compared against other models or variations of themselves.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents two main tables summarizing the characteristics and performance of the reviewed co-segmentation and semantic segmentation methods.
The following are the results from Table 1 of the original paper, comparing and analyzing main co-segmentation methods. This table details the features, co-information strategies, and optimization techniques used across different co-segmentation paradigms.
| Methods | Ref. | Foreground Feature | Co-Information | Optimization |
|---|---|---|---|---|
| MRF-Based Co-Segmentation | [24] | color histogram | L1 norm | graph cuts |
| [26] | color histogram | L2 norm | quadratic pseudo-Boolean | |
| [27] | color and texture histograms | reward model | maximum flow | |
| [25] | color histogram | Boykov—Jolly model | dual decomposition | |
| [46] | color and SIFT features | region matching | graph cuts | |
| [29] | SIFT feature | K-means + L1, 2 | graph cuts | |
| [48] | SIFT feature | Gaussian mixture model (GMM) constraint | graph cuts | |
| [33] | color and texture histograms | improved random walk global term | gradient projection and conjugate gradient (GPCG) | |
| Co-Segmentation Based on Random Walks | [34] | intensity and gray difference | improved random walk global term | graph size reduction |
| [35] | label prior from user scribbles | GMMs | minimize the average reaching probability | |
| [38] | color histogram | reward model | level set function | |
| Co-Segmentation Based on Active Contours | [39] | co-registered atlas and statistical features | k-means | level set function |
| [40] | saliency information | improved Chan-Vese (C-V) model | level set function | |
| [41] | SIFT, Gabor filter, color histogram | Chi-square distance | low-rank | |
| Clustering-Based Co-Segmentation | [43] | color and location information | discriminant clustering | expectation maximization (EM) |
| [42] | pyramid of LAB colors, HOG textures, SURF features histogram | hierarchical clustering | normalized cut criterion | |
| [44] | color histogram | built digraphs according to region similarity and saliency | shortest path | |
| Co-Segmentation based on Graph Theory | [45] | color and shape information | build global items based on digraphs and saliency | shortest path |
| [46] | lab space color and texture information | Gaussian consistency | Sub-modularity optimization | |
| Co-Segmentation Based on Thermal Diffusion | [47] | color and texture histograms | GMM & SPM (spatial pyramid matching) | dynamic programming |
| [48] | multi-scale saliency, color contrast, edge density and | Bayesian framework | maximizing the posterior probability | |
| Object-Based Co-Segmentation | [49] | superpixels straddling 33 types of features | random forest classifier | A-star search algorithm |
Analysis of Table 1 (Co-Segmentation):
-
Feature Evolution: Initially,
MRF-Basedmethods often relied on basiccolor histograms. LaterMRFand otherco-segmentationmethods incorporated richer features liketexture histograms,SIFT features,Gabor filters,LAB colors,HOG textures, andSURF features. This shows a clear trend towards more robust and descriptiveforeground features.Saliency informationandmulti-scale saliencyalso emerged as important features. -
Co-Information Strategies: Various strategies were developed to model
co-information(similarity across images). These includeL1/L2 norms,reward models,Boykov-Jolly model,region matching,K-meansfor feature grouping,GMM constraints,improved random walk global terms,Chi-square distance,discriminant clustering,hierarchical clustering,digraphsbased on similarity/saliency,Gaussian consistency, andBayesian frameworks. This diversity highlights the challenge of effectively capturing shared object characteristics across different images. -
Optimization Techniques:
Graph cutsis a dominant optimization method, especially forMRF-basedapproaches.Level set functionsare key foractive contour-based methods. Other techniques includequadratic pseudo-Boolean,maximum flow,dual decomposition,gradient projection and conjugate gradient (GPCG),graph size reduction,minimizing average reaching probability,low-rank approximation,expectation maximization (EM),normalized cut criterion,shortest path algorithms,sub-modularity optimization,dynamic programming, andA-star search. The choice of optimization often depends on the mathematical formulation of theenergy functionorobjective function.The following are the results from Table 2 of the original paper, comparing and analyzing semantic segmentation methods based on deep learning. This table focuses on the publication year,
backbone networks, datasets,mIoUperformance, and major contributions of influentialdeep learningmodels.Algorithms Pub. Year Backbone Experiments Major Contributions Datasets mIoU (%) FCN [54] 2015 VGG-16 PASCAL VOC 2011 NYUDv2 62.7 34.0 The forerunner for end-to-end semantic segmentation U-Net [64] 2015 VGG-16 PhC-U373 DIC-HeLa CamVid 92.03 77.56 Encoder-decoder structure, skip connections SegNet [61] 2016 VGG-16 SUN RGBD 60.4 28.27 Transferred the max-pooling indices to the decoder DeepLabv1 [65] 2016 VGG-16 PASCAL VOC 2012 71.6 Atrous convolution, fully connected CRFs Dilated convolutions, multi-scale MSCA [88] 2016 VGG-16 PASCAL VOC 2012 PASCAL 75.3 context aggregation, front-end context module Reconstruction up-sampling LRR [73] 2016 ResNet/VGG-16 VOC 2011 Cityscapes CamVid 77.5 69.7 91.6 module, Laplacian pyramid refinement ReSeg [79] 2016 VGG-16 & ReNet Oxford Flowers CamVid 93.7 58.8 Extension of ReNet to semantic segmentation DRN [70] 2017 ResNet-101 Cityscapes PASCAL 70.9 Modified Conv4/5 of ResNet, dilated convolution PSPNet [72] 2017 ResNet50 VOC 2012 Cityscapes 85.4 80.2 Spatial pyramid pooling (SPP) DeepLab V2 [66] 2017 VGG-16/ ResNet-101 PASCAL VOC 2012 Cityscapes 79.7 70.4 Atrous spatial pyramid pooling (ASPP), fully connected CRFs DeepLab V3 [67] 2017 ResNet-101 PASCAL VOC 2012 Cityscapes PASCAL 86.9 81.3 Cascaded or parallel ASPP modules DeepLab V3+ [68] 2018 Xception VOC 2012 89.0 A new encoder-decoder structure with DeepLab V3 as an encoder DUC-HDC [62] Cityscapes 82.1 HDC (hybrid dilation convolution) was proposed to solve the gridding caused by dilated convolutions 2018 ResNet-101/ResNet-152 PASCAL VOC 2012 83.1 Cityscapes 80.1 2018 VGG-16 with AGs TCIA Pancreas CT-82 A novel self-attention gating (AGs) filter, skip connections Attention U-Net [83] CT-150 PSANet [85] 2018 ResNet-101 ADE20K PASCAL VOC 2012 Cityscapes 81.51 85.7 81.4 Point-wise spatial attention maps from two parallel branches, bi-direction information propagation model APCNet [75] 2019 ResNet-101 PASCAL VOC 2012 PASCAL Context ADE20K 84.2 54.7 45.38 Multi-scale, global-guided local affinity (GLA), adaptive context modules (ACMs) DANet [86] 2019 ResNet-101 Cityscapes PASCAL VOC 2012 PASCAL Context COCO Stuff 81.5 82.6 52.6 39.7 Dual attention: position attention module and channel attention module CARAFE [87] 2019 ResNet-50 ADE20k 42.23 Pyramid pooling module (PPM), feature pyramid network (FPN), multi-level feature fusion (FUSE) EFPN [76] 2021 VGG-16 PASCAL VOC 2012 Cityscapes PASCAL Context 86.4 82.3 53.9 PPM, multi-scale feature fusion module with a parallel branch CARAFE++ [88] 2021 ResNet-101 ADE20k 43.94 PPM, FPN, FUSE, adaptive kernels on-the-fly A novel shifted windowing scheme, Swin Transformer [93] 2021 Swin-L Swin-L 53.5 a general backbone network for computer vision Skip connections, an intermediate layer that combines the feature maps of the fourth-layer Attention UW-Net [84] 2022 ResNet50 NIH Chest X-ray encoder with the feature maps of the last-layer encoder layer, attention mechanism Bilateral directional FPN, FPANet [77] 2022 ResNet18 Cityscapes CamVid 75.9 74.7 lightweight ASPP, feature pyramid fusion module (FPFM), border refinement module (BRM)
Analysis of Table 2 (Semantic Segmentation):
- Performance Trend: There's a clear trend of increasing
mIoUover time, indicating continuous improvement insemantic segmentationaccuracy. Early models likeFCNachieve around 62.7%mIoUonPASCAL VOC 2011, while laterDeepLab V3+reaches 89.0% onPASCAL VOC 2012.PSPNetandDeepLabseries consistently show high performance on challenging datasets likeCityscapes. - Backbone Evolution:
VGG-16was a commonbackbonein earlier models (FCN,U-Net,SegNet,DeepLabv1). Asdeep learningprogressed,ResNet(e.g.,ResNet50,ResNet-101,ResNet-152) became the dominantbackbonedue to its ability to train deeper networks and mitigatevanishing gradientswithskip connections. More specializedbackboneslikeXception(DeepLab V3+) and customTransformer-based backbones(Swin-LforSwin Transformer) represent the cutting edge. - Architectural Innovations:
- Encoder-Decoder & Skip Connections:
U-Net(2015) highlighted the importance ofencoder-decoderstructures andskip connectionsfor precise segmentation, especially inmedical imaging. - Dilated Convolution & Multi-scale Context: The
DeepLabseries (starting 2016) extensively utilizedatrous convolutionandASPPto capturemultiscale contextwithout sacrificing resolution.PSPNet(2017) introducedPPMfor similar goals.DUC-HDCspecifically tackled thegridding problemofdilated convolutions. - Attention Mechanisms:
Attention U-Net(2018) andPSANet/DANet(2018-2019) demonstrated the effectiveness ofattention mechanismsin enhancing feature representation and modeling dependencies. - Transformers: The
Swin Transformer(2021) marked a significant shift, showing thatTransformer-based architectures can serve as generalbackbonesforcomputer vision, achieving strong performance in segmentation tasks.
- Encoder-Decoder & Skip Connections:
- Dataset Specialization: While
PASCAL VOCandCityscapesremain popular benchmarks for general and street scene segmentation,U-NetandAttention U-Netshow high performance on specializedmedical imaging datasetslikePhC-U373,DIC-HeLa,TCIA Pancreas CT-82, andNIH Chest X-ray, demonstrating the adaptability of these architectures. - Real-time Considerations: The entry for
FPANet(2022) highlights the ongoing focus onreal-time semantic segmentation, emphasizing lightweight architectures and efficient modules likelightweight ASPPandFPFM.
6.2. Ablation Studies / Parameter Analysis
As a comprehensive review paper, this document synthesizes findings from numerous primary research articles. Therefore, it does not conduct its own ablation studies or parameter analyses. Instead, it implicitly reports on the results of such studies from the original papers by detailing the "Major Contributions" and observed performance (mIoU) improvements attributed to specific architectural components (e.g., skip connections in U-Net, ASPP in DeepLab, attention gates in Attention U-Net, shifted windowing scheme in Swin Transformer). The improvements in mIoU shown in Table 2 for successive versions of models (e.g., DeepLab V1 to DeepLab V3+) are indirect evidence of component effectiveness and hyper-parameter optimization performed by the original authors.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper provides a valuable, systematic review of image segmentation techniques, tracing their evolution from classic segmentation through co-segmentation to semantic segmentation based on deep learning. It highlights a clear developmental trajectory from coarse-grained to fine-grained analysis, manual feature extraction to adaptive learning, and single-image-oriented methods to approaches that leverage big data for common feature learning. The paper thoroughly elaborates on the main algorithms and key techniques at each stage, offering a comprehensive overview of influential models and their respective advantages and limitations. The Fully Convolutional Network (FCN) is identified as the foundational breakthrough for deep learning in segmentation, leading to a proliferation of CNN-based architectures that are now transitioning into Transformer-based models.
7.2. Limitations & Future Work
The authors identify several key challenges and future research directions for image segmentation techniques:
- Complexity of Segmentation Tasks:
Semantic segmentation,instance segmentation, andpanoramic segmentationremain active research hotspots.Panoramic segmentationis particularly challenging due to the need to simultaneously recognizecountable instancesanduncountable stuff regionswithin a single workflow, requiring robust networks to handle both large inter-category and small intra-category differences. - 3D Data Segmentation: With the rise of
3D acquisition equipment(e.g.,LiDAR cameras),RGB-depth,3D-point clouds,voxels, andmesh segmentationare gaining importance for applications likeface recognition,autonomous vehicles,VR/AR, andarchitectural modeling. However, the representation and processing of inherently unstructured, redundant, disordered, and unevenly distributed3D datapose significant challenges. - Data Scarcity and Annotation Limitations: Many fields suffer from a lack of large, fine-grained annotated datasets, hindering the training of
supervised deep learning algorithms. Future work needs to exploresemi-supervised,unsupervised,transfer learning, andfew-shot image semantic segmentationapproaches, which can learn effectively from limited labeled samples.Reinforcement learningis also noted as a possible, though less explored, solution. - Computational Efficiency and Real-time Performance:
Deep learning networksdemand substantial computing resources during training and inference. Achievingreal-time(e.g., >25 fps for video processing) segmentation is a critical requirement for many applications. Balancingmodel accuracywithreal-time performanceremains a significant challenge, despite progress withlightweight networks. - Explicability of Deep Learning: The "black box" nature of
deep learningmodels limits theirrobustness,reliability, andperformance optimizationin critical downstream tasks. Improving model interpretability is a crucial long-term goal.
7.3. Personal Insights & Critique
This review paper provides an excellent, structured overview of image segmentation, which is highly beneficial for beginners in the field. The chronological categorization into classic, co-segmentation, and deep learning-based methods effectively illustrates the progression of research and the increasing complexity of techniques. The detailed breakdown of deep learning architectures, including encoder-decoder, skip connections, dilated convolution, multiscale feature extraction, and attention mechanisms, is particularly helpful for understanding the building blocks of modern segmentation models. The authors' discussion of challenges and future trends is insightful, accurately reflecting the current research landscape, especially the pivotal shift towards Transformers.
One area for potential improvement, while understandable given the breadth of a review, is that some classic segmentation algorithms are described conceptually without delving into their mathematical formulations. For instance, active contour energy functions or graph cut energy minimization could have been presented with their core equations to offer a deeper technical understanding of their mechanics, even if simplified. However, for a beginner-friendly overview, the current level of detail is appropriate.
The paper's strong emphasis on the evolution from CNNs to Transformers is timely and relevant, as Transformers are indeed reshaping the computer vision landscape. The identified limitations, such as the challenges of 3D data and the need for unsupervised/semi-supervised learning due to data scarcity, point to crucial directions for future innovation. The problem of deep learning explicability is also a critical, cross-disciplinary challenge that segmentation research, like many other AI fields, must confront to achieve broader trust and deployment in sensitive applications such as medical imaging. Overall, this paper serves as a valuable resource for navigating the diverse and rapidly evolving field of image segmentation.
Similar papers
Recommended via semantic vector search.