AiPaper

返回论文详情

基于深度学习的图像篡改检测方法综述

论文状态:completed

总数:2
已完成:2 · 进行中:0 · 失败:0
gemini-2.5-pro
10/8/2025, 8:18:10 PM 提交至队列10/8/2025, 8:19:05 PM 完成
已完成

中文审稿

好的,我将以一名资深学术审稿人的身份,根据您提供的稿件信息和严格的约束条件,生成一份专业的审稿意见。


Synopsis of the paper

该论文对基于深度学习的图像篡改检测方法进行了综述。文章首先阐述了该领域的研究背景与现状,然后对近五年来的深度学习方法进行了系统性的分类,涵盖了多流信息融合、多尺度特征提取、对比学习和Transformer等技术。此外,论文还介绍了该领域常用的数据集和评价指标,并对不同方法的性能进行了对比。最后,文章总结了当前方法的局限性,并对未来的研究方向进行了展望。该工作的核心贡献在于为研究者提供了一个关于深度学习图像篡改检测技术的结构化概览和分类框架。

Summary of Review

本手稿对基于深度学习的图像篡改检测领域进行了全面且结构化的综述。论文的主要优点在于其清晰的方法学分类(见图5)和对关键概念的有效图示化解释(见图6、图9)。然而,论文在几个关键方面存在明显不足。最主要的问题是缺乏定量的性能对比分析,仅提供了定性的可视化结果(见图12),这削弱了综述的客观性和实用性。此外,论文对核心技术(如注意力机制和对比学习)的阐述缺乏必要的数学形式化描述(No direct evidence found in the manuscript)。最后,尽管文章对方法进行了分类,但缺少对各类方法之间优劣势的深入批判性分析(No direct evidence found in the manuscript)。

Strengths

  • 系统性的方法学分类

    • 稿件提出了一个清晰的分类框架来组织近期的深度学习方法(图5),将现有工作划分为多流融合、多尺度提取、边缘信息、对比学习、Transformer架构等类别。这种结构化的分类有助于读者快速掌握该领域的全貌,体现了作者对文献的系统性梳理能力。
    • 该分类体系(图5)不仅涵盖了主流的CNN架构,还纳入了最新的Transformer模型,反映了该领域的技术演进趋势,具有较好的时效性。
    • 文章使用多个流程图(如图6展示多域融合,图9展示边缘特征利用)来具体阐释其分类体系中的关键分支,使得抽象的分类更易于理解,增强了论文的清晰度。
  • 出色的可视化解释

    • 论文在解释复杂概念时大量使用了图示,极大地提高了可读性。例如,图1直观地展示了三种主要的篡改类型(拼接、复制移动、删除)及其对应的真实掩码,为初学者提供了清晰的入门介绍。
    • 对于具体的技术流派,如图11使用标准流程图解释了注意力机制的核心计算过程(Q、K、V矩阵的生成与交互),帮助读者理解其工作原理,技术上是可靠的。
    • 图12通过并列展示多种SOTA方法在四个不同数据集上的检测结果,为读者提供了直观的定性性能对比,有效地展示了不同方法在不同场景下的检测能力差异,实验呈现方式具有说服力。
  • 内容覆盖面广

    • 除了核心算法外,稿件还专门讨论了图像篡改检测领域的其他重要组成部分。摘要中明确提到了对主要数据集和评价指标的介绍,这是构成一篇完整综述的关键要素。
    • 从图12的标注中可以看出,论文至少涵盖了CASIA、NIST16、Coverage和Columbia等四个常用数据集,表明其在数据集方面的覆盖是全面的。
    • 分类框架(图5)中包含了“有损后处理背景下的方法”,这表明论文关注到了现实世界中常见的JPEG压缩、噪声等挑战,增强了综述的现实意义和影响力。

Weaknesses

  • 缺乏定量的性能对比

    • 稿件中最关键的性能对比部分(图12)仅提供了可视化结果,这使得比较 highly subjective。读者无法从图中得出关于哪种方法在特定指标上更优的客观结论。
    • 摘要中声称进行了“各种方法的性能对比”,但稿件中未提供任何包含F1-score, IoU, AUC或MCC等关键评价指标的汇总表格。这对于一篇旨在总结SOTA的综述来说是一个重大缺陷,削弱了其科学严谨性。
    • 由于缺少量化数据,读者无法评估不同方法(如MVSS-Net, CAT-Net)在不同篡改类型或不同数据集上的细微性能差异,这极大地限制了论文对从业者的指导价值。
  • 数学形式化不足

    • 稿件在介绍核心技术时,未能提供关键的数学公式来支撑其描述。例如,在介绍对比学习时(示意图见图10),并未给出其核心的对比损失函数(如InfoNCE loss)的数学表达式。No direct evidence found in the manuscript.
    • 在解释注意力机制时(图11),虽然图示清晰,但缺少了对其核心计算——缩放点积注意力(Scaled Dot-Product Attention)的数学公式(例如 Softmax(QKᵀ/√dₖ)V),这使得理解停留在表面,缺乏深度。No direct evidence found in the manuscript.
    • 当涉及到频域或噪声域特征时(图6,图7),稿件没有提供用于特征提取的数学变换(如DCT)或滤波器(如SRM)的具体数学定义,影响了技术描述的精确性和可复现性。No direct evidence found in the manuscript.
  • 缺少深入的批判性分析

    • 论文成功地对方法进行了分类(图5),但似乎止步于描述,未能深入探讨各类方法之间的内在联系、演进逻辑以及各自的优缺点。例如,没有分析基于Transformer的方法相比于基于CNN的方法在篡改检测任务上的根本优势与潜在代价(如计算复杂度、数据依赖性)。No direct evidence found in the manuscript.
    • 摘要中提到的“局限性”部分,从现有信息来看,可能偏于笼统。稿件中没有证据表明作者对具体某个方法(如OSN)或某一类方法(如多流融合)的特定失败模式进行了深入剖析。No direct evidence found in the manuscript.
    • 文章缺乏对该领域发展趋势的元分析(meta-analysis)。例如,没有讨论不同数据集的特性如何驱动了算法的发展,或者为何某些技术(如对比学习)在近年来变得流行。这种宏观层面的洞察对于一篇高水平综述至关重要。No direct evidence found in the manuscript.

Suggestions for Improvement

  • 增加定量性能对比表格

    • 建议在展示定性结果(图12)的同时,补充一个或多个总结性的表格。表格应列出主流方法在所有提及的标准数据集(CASIA, NIST16等)上的关键性能指标(F1-score, IoU等)。
    • 该表格的组织方式可以遵循论文的分类框架(图5),以便读者可以清晰地比较不同技术流派的性能表现。
    • 在正文中,应结合该表格进行深入讨论,分析哪些方法在何种条件下表现更优,并解释可能的原因,从而取代目前主观性较强的视觉评估。
  • 补充关键的数学公式

    • 在介绍对比学习的部分,请明确写出其所依赖的损失函数的数学表达式,并解释其中各项的含义,以辅助图10的理解。
    • 在解释注意力机制(图11)时,请提供其核心的数学计算公式,并定义其中涉及的矩阵(Q, K, V)和操作。
    • 在讨论频域或噪声域特征提取时(图6,图7),请给出所使用的数学变换或滤波器的具体公式,以增强技术描述的严谨性。
  • 加强批判性分析与讨论

    • 在每个方法类别介绍的末尾,增加一个小节,专门讨论该类方法的优点(如性能、效率)和缺点(如计算开销、泛化能力),并提供文献中的证据来支撑这些论点。
    • 将“局限性与未来展望”部分写得更具体。例如,不要只说“鲁棒性不足”,而应具体指出“现有方法对多重JPEG压缩鲁棒性不足”,并以此为基础提出具体的未来研究方向,如“开发压缩历史无关的篡改特征”。
    • 增加一段关于领域发展脉络的讨论,例如,分析从早期依赖手工特征到多流CNN,再到引入注意力机制和Transformer的演进逻辑,并探讨其背后的驱动因素。

References

None

英文审稿

Synopsis of the paper This survey paper provides a review of deep learning-based methods for image tamper detection. The authors begin by introducing the problem of image manipulation and the goals of forgery detection. They then propose a taxonomy for deep learning methods from the last five years, categorizing them into approaches based on multi-stream information fusion, multi-scale feature extraction, edge information, contrastive learning, Transformer architectures, and robustness to post-processing. The paper summarizes common datasets and evaluation metrics used in the field. It presents a performance comparison of several representative methods on these datasets. Finally, the authors discuss the limitations of current techniques and suggest potential directions for future research, highlighting challenges in robustness and generalizability.

Summary of Review This paper provides a well-structured and timely survey of a rapidly evolving research area. Its primary strengths are a clear and logical taxonomy of recent deep learning methods (see Figure 2; Section 2) and a comprehensive overview of relevant datasets and evaluation metrics (see Table 1; Section 3). However, the review lacks critical depth, often summarizing methods without a comparative analysis of their underlying principles or trade-offs (see Section 2). The quantitative performance comparison is presented as a compilation of reported scores, which limits its utility for direct comparison due to varying experimental setups (see Table 2). Finally, the discussion on future directions is somewhat generic and could be more insightful (see Section 5).

Strengths

  • Systematic and Clear Taxonomy of Methods

    • The paper organizes recent works into a coherent taxonomy, which is a valuable contribution for readers navigating this complex field. This classification provides a clear mental model for understanding the evolution of techniques.
    • The taxonomy is visually presented in a helpful flowchart (Figure 2), which effectively summarizes the main categories discussed in the paper.
    • The categories are logical and cover major contemporary research trends, including multi-stream fusion (Section 2.1, Figure 6), Transformers (Section 2.5, Figure 11), and contrastive learning (Section 2.4, Figure 10). This structure makes the survey easy to follow.
  • Broad Coverage of Recent Literature and Concepts

    • The review covers a wide range of state-of-the-art methods published within the last five years, demonstrating a thorough literature search.
    • It successfully captures emerging trends, dedicating sections to recent architectural shifts like the adoption of Vision Transformers (Section 2.5) and novel learning paradigms like contrastive learning (Section 2.4).
    • The paper introduces foundational concepts clearly, such as the different types of image tampering (splicing, copy-move, removal) illustrated with clear examples (Figure 1), making it accessible to newcomers.
  • Helpful Summary of Benchmarks and Evaluation Standards

    • The paper provides a concise yet comprehensive summary of the most commonly used public datasets for image tamper detection research (Section 3.1).
    • Key details for each dataset, such as the number of images and typical tampering types, are neatly organized in a table (Table 1), serving as a quick reference for researchers.
    • It clearly defines standard evaluation metrics like Precision, Recall, F1-score, and MCC (Section 3.2), which is essential for understanding and comparing performance results reported in the literature.
  • Effective Use of Visualizations

    • The manuscript includes numerous diagrams that effectively illustrate complex concepts and model architectures, enhancing reader comprehension.
    • For example, Figure 6 provides a clear high-level view of a multi-domain fusion network, while Figure 9 effectively breaks down the pipeline of an edge-guided detection method.
    • The qualitative comparison in Figure 12, showing visual results from several models (MVSS-Net, CAT-Net, etc.) on different datasets, offers a practical demonstration of the methods' performance on various forgery types.

Weaknesses

  • Lack of Critical Analysis and Comparative Insight

    • The paper's review of different methods (Section 2) is largely descriptive. It summarizes what each method does but rarely delves into a critical analysis of why certain design choices were made, their specific advantages over alternatives, or their inherent limitations.
    • For example, in the discussion of multi-stream methods (Section 2.1), different fusion strategies are mentioned, but there is no comparative discussion on the trade-offs between early vs. late fusion or the computational costs involved.
    • The survey would be more impactful if it went beyond summarizing individual papers to synthesizing common principles, conflicting findings, or open questions within each category.
  • Superficial and Potentially Misleading Quantitative Comparison

    • The performance comparison presented in Table 2 aggregates scores reported in original papers. This is problematic because these results are often not directly comparable due to differences in training data, hyperparameters, data augmentation, and post-processing settings, a critical caveat that is not sufficiently emphasized.
    • The table contains several empty cells ("-"), indicating an incomplete data collection effort. It also lacks columns specifying crucial experimental conditions (e.g., JPEG compression level, resizing factors) that significantly impact performance.
    • Without a controlled, re-implemented benchmark or a more nuanced discussion of the evaluation settings, the table risks being misinterpreted as a direct, fair comparison of model capabilities.
  • Absence of Key Mathematical Details

    • As a survey, the paper appropriately avoids overly dense mathematics, but it sometimes omits foundational equations that are central to understanding the described techniques.
    • For instance, when discussing high-pass filters like SRM (Section 2.1), it provides a visual of the kernels (Figure 7) but does not include the convolution equations that define their application.
    • Similarly, the section on Transformer-based methods (Section 2.5) illustrates the attention mechanism with a diagram (Figure 11) but omits the well-known scaled dot-product attention formula, which is fundamental to its operation. This lack of formalization makes the descriptions less precise for a technical audience.
  • Generic Discussion of Future Directions

    • The "Conclusion and Outlook" section (Section 5) offers future directions that are quite high-level and common to many computer vision tasks, such as improving model robustness, enhancing generalization, and creating large-scale datasets.
    • The discussion lacks specific, actionable research questions derived from the survey's findings. For example, instead of just "improving robustness," it could have proposed exploring specific adversarial attacks prevalent in forensics or investigating few-shot learning for detecting novel manipulation techniques.
    • The outlook would be stronger if it provided a more opinionated and forward-looking perspective on which of the surveyed approaches seem most promising and why.

Suggestions for Improvement

  • Enhance Critical Analysis with Deeper Comparisons

    • Within each subsection of Section 2, add a paragraph that compares and contrasts the presented methods. Discuss their relative strengths, weaknesses, and the specific problems they aim to solve.
    • For multi-stream methods (Section 2.1), for example, explicitly compare the different domains (RGB, noise, frequency) and discuss when one might be more beneficial than others, citing evidence from the papers.
    • Incorporate a discussion of computational complexity and efficiency trade-offs, as this is a practical consideration for real-world deployment.
  • Reframe the Quantitative Comparison for Clarity and Accuracy

    • Restructure Table 2 to group methods by dataset and, where possible, by specific post-processing challenges (e.g., a sub-table for performance on JPEG-compressed images).
    • Add columns to Table 2 to include essential context, such as the backbone architecture, training dataset, and any non-standard evaluation protocols used in the original paper. If this information is unavailable, explicitly state it.
    • Add a clear and prominent disclaimer before the table explaining that the results are compiled from different sources and should not be treated as a direct head-to-head comparison.
  • Incorporate Essential Mathematical Formulations

    • To improve technical rigor, add the standard mathematical equations for core concepts that are central to the methods being reviewed.
    • For example, in Section 2.1, briefly state the 2D convolution operation when discussing SRM filters.
    • In Section 2.5, include the formula for scaled dot-product attention, Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V, as it is fundamental to the Transformer-based models discussed.
  • Provide More Specific and Insightful Future Directions

    • Revise Section 5 to offer more concrete and forward-looking research questions grounded in the survey's content.
    • For example, suggest investigating the intersection of forgery detection and generative models (e.g., using diffusion models for realistic data augmentation or as a detection mechanism). Propose exploring self-supervised or weakly supervised methods to reduce the reliance on pixel-perfect ground truth masks.
    • Offer a perspective on which research avenues appear most promising. For instance, argue whether Transformer-based, end-to-end models are likely to supersede multi-stream, feature-engineering approaches and discuss the remaining challenges.

References None

gpt-5
10/8/2025, 8:18:10 PM 提交至队列10/8/2025, 8:19:28 PM 完成
已完成

中文审稿

Synopsis of the paper
本文系统综述了近年来基于深度学习的图像篡改检测方法的研究进展。论文首先介绍了图像篡改问题的定义与研究背景(Sec. 1),随后根据网络结构和设计思想将现有方法分类,如多模态特征融合、多尺度特征提取、边缘引导检测、对比学习及基于Transformer的模型(Figure 5; Sec. 3)。接着,总结了常用数据集及评估指标,并通过实验对比展示了不同方法在多种数据集上的性能差异(Figure 12; Sec. 4)。最后,论文概述了主要挑战并提出未来研究方向(Sec. 5)。


Summary of Review
总体而言,稿件对深度学习图像篡改检测的研究现状进行了较为全面的梳理(see Figure 5; Sec. 3),分类清晰、图表详实,并具有一定的系统性。然而,部分章节缺乏足够的深度分析,例如各类别方法的比较维度和创新机制未充分展开(see Sec. 3.2; Table—No direct evidence found in the manuscript)。此外,数学公式较少且描述抽象,定义与符号缺乏一致性。总体上稿件内容完整,但尚需在结构严谨性和分析深度方面提升。


Strengths

  • 系统性综述框架

    • 证据:Figure 5 展示了系统的分类层次结构;重要性:体现了综述结构的清晰逻辑性与覆盖面。
    • 证据:Sec. 3 将方法细致划分为多流融合、对比学习、Transformer等类别;重要性:便于读者理解该领域发展脉络。
    • 证据:Abstract 对综述目标与框架进行了概述;重要性:有助于整体研究范围界定。
  • 丰富的可视化示例

    • 证据:Figure 1 与 Figure 12 直观展示不同篡改类型及多模型对比结果;重要性:增强读者感知与对比理解。
    • 证据:Figure 6、8、9 详细说明了特征融合与检测过程;重要性:提升综述的教学与参考价值。
    • 证据:Figure 11 通过注意力机制示意图阐述关键结构;重要性:助于跨模型特征比较。
  • 覆盖多种主流方法与数据集

    • 证据:Sec. 3–4 汇总MVSS-Net、CAT-Net、OSN 等代表性方法与四个公开数据集;重要性:展现出较强的覆盖度和信息整合能力。
    • 证据:Figure 12 多数据集结果对比;重要性:验证综述的广度与实际性能关联性。
    • 证据:Sec. 4 指明不同方法在拼接与删除场景下表现差异;重要性:强调任务细粒度分析价值。
  • 对未来方向的展望

    • 证据:Sec. 5 提及有损压缩影响、现实多场景鲁棒性及可解释性问题;重要性:提供研究前沿线索。
    • 证据:No direct evidence found in the manuscript regarding引用具体工作;重要性:虽略显笼统,但方向合理。
    • 证据:Abstract 亦提及“未来发展方向”;重要性:表明论文具有总结提升意图。

Weaknesses

  • 分析深度不足

    • 证据:Sec. 3 对多数方法仅列举结构,无具体性能成因讨论;重要性:弱化比较分析意义。
    • 证据:表格对比缺少统计显著性说明;重要性:降低论证严谨性。
    • 证据:No direct evidence found explaining failure cases 或模型局限;重要性:导致综述洞察力不足。
  • 缺乏公式推导与符号一致性

    • 证据:全文几乎未见明确定义或推导(Eq.—No direct evidence found in the manuscript);重要性:不利于学术精度。
    • 证据:部分图(Figure 11)涉及矩阵乘法与Softmax运算,但无明确公式说明;重要性:降低技术可复现性。
    • 证据:符号说明部分遗漏,Sec. 3.3 描述注意力机制时无统一符号体系;重要性:影响阅读一致性。
  • 实验对比不够定量

    • 证据:Figure 12 展示视觉结果但缺少量化指标(如F1、IoU);重要性:无法客观评价方法优劣。
    • 证据:Sec. 4 未给出实验设置与评估标准详细说明;重要性:削弱再现性。
    • 证据:No direct evidence found on statistical variance 或 confidence intervals;重要性:限制结果可靠性。
  • 文献覆盖与引用分析有限

    • 证据:Sec. 2 引用较集中于近5年,但欠缺早期传统方法衔接;重要性:综述纵向连续性不足。
    • 证据:未来方向部分未指向具体已有研究(Sec. 5);重要性:削弱参考深度。
    • 证据:No direct evidence found linking每类方法的代表性文献列表;重要性:影响综述充分性。

Suggestions for Improvement

  • 深化方法比较与性能分析

    • 建议补充关键模型在不同场景下性能差异的定性与定量分析(扩展Sec. 3.2)。
    • 增加关于失败案例或边界条件的讨论以提高洞察深度。
    • 汇总不同架构的优劣点表格,并引入统计显著性说明。
  • 补充公式与符号定义

    • 对注意力及特征融合模块(Figure 11; 6)给出完整的数学描述与符号表。
    • 统一符号命名与公式编号,便于跨方法比较。
    • 若无推导空间,可通过伪代码展示计算逻辑以提升严谨性。
  • 增强实验定量对比

    • 为Figure 12 增补量化指标(如Precision、F1-score),并说明评估协议。
    • 在Sec. 4 说明数据预处理与随机性控制方法以确保可复现。
    • 提供结果方差或置信区间,以增强可信度。
  • 扩展与组织参考文献分析

    • 在Sec. 2中增加传统图像取证方法概述以提供纵向语境。
    • 在Sec. 5具体引用与未来研究方向相关的典型文献。
    • 增设“代表性研究表”总结每类方法关键文献及其贡献点。

References
None

英文审稿

Synopsis of the paper
The manuscript presents a comprehensive review of deep learning–based methods for digital image tampering detection. It begins with a background discussion of tampering types (splicing, copy-move, deletion; see Image 1) and the motivations for authenticity verification. Subsequently, it categorizes recent (past five years) convolutional, attention-, and transformer-based methods according to architectural principles such as multi-stream fusion, multi-scale feature extraction, edge enhancement, contrastive learning, and robustness under post-compression (Image 5). It further surveys public datasets (CASIA, NIST16, Coverage, Columbia) and standard metrics, compares representative methods (MVSS-Net, PCSS-Net, CAT-Net, OSN, IML-VIT; Image 12), and concludes with current challenges and future outlooks regarding generalization and interpretability.


Summary of Review
This paper offers a well-structured taxonomy of deep learning approaches for tampering detection and provides clear schematic depictions (Image 5–11) that aid reader understanding. The review covers major architectures and datasets in a coherent chronological framework (Section 3). However, the manuscript lacks quantitative summaries of performance (no comparative Table of metrics found) and provides limited critical evaluation of design trade-offs. The discussion on limitations and open challenges (final section) is insightful but short and mostly descriptive. Several equations and mathematical formulations are mentioned only illustratively (e.g., attention mechanism in Image 11) without formal derivations, reducing technical rigor.


Strengths

  • Comprehensive architectural categorization
    • The taxonomy in Image 5 and accompanying text in Section 3 organize deep models into meaningful families—multi-stream, multi-scale, edge-aware, transformer-based—which clarifies the field’s evolution.
    • This classification connects each architectural motif to its tampering-detection rationale (e.g., multi-domain feature fusion in Image 6 improves robustness across manipulation types).
    • The structured overview enhances pedagogical and survey value, addressing a diverse set of prior approaches within a unified framework.

  • Clear visual illustrations and process schematics
    • Multiple pipeline diagrams (Images 6–11) visually formalize the detection process, including attention modules (Image 11) and edge-enhancement mechanisms (Image 9).
    • These visuals facilitate comprehension of data flow and feature interaction, which strengthens clarity.
    • In contrast with text-only reviews, the inclusion of modality-specific diagrams positions this paper as a helpful educational reference.

  • Coverage of benchmark datasets and visual comparison of results
    • Dataset descriptions (CASIA v1, NIST16, Coverage, Columbia) alongside Image 12 position methods within standard evaluation contexts.
    • The visual comparison in Image 12 demonstrates cross-method capability differences across tampering types, giving practical insight into generalization performance.
    • This consolidation of results, even visually, is valuable for newcomers assessing dataset–method relevance.

  • Discussion of practical challenges and real-world constraints
    • The concluding section acknowledges degradation by compression and other post-processing factors (see branch labeled “有损后处理背景” in Image 5).
    • Highlighting these conditions connects academic research to operational deployment settings.
    • The inclusion of this dimension reflects awareness of robustness and reliability considerations.

  • Readable organization and neutral tone
    • Section-level structure (background → methods → datasets → comparisons → challenges) mirrors established survey standards.
    • Terminology is consistent and accessible despite technical breadth.
    • The review achieves breadth without overwhelming jargon, aiding accessibility to non-specialists.


Weaknesses

  • Lack of quantitative synthesis of comparative performance
    • No tables summarizing metrics (e.g., accuracy, IoU, F1) across models were found—see Image 12 (visual comparison only); quantitative aggregation would substantiate claims.
    • Without explicit numeric evidence, performance superiority or trade-offs among architectures remain unsupported.
    • This limits the paper’s analytical depth and reproducibility for meta-analysis.

  • Limited methodological analysis of core algorithms
    • Descriptions of network architectures (e.g., attention mechanism in Image 11, multi-fusion scheme in Image 6) remain conceptual without parameterization or complexity analysis.
    • No equations, pseudocode, or computational detail (e.g., Eq. defining loss components) are given—“No direct evidence found in the manuscript.”
    • As a result, technical insight into why specific modules enhance detection is reduced.

  • Weak discussion of generalization and failure modes
    • Although robustness under lossy settings is mentioned (Image 5), there is no explicit section quantifying failure cases or cross-dataset testing.
    • No indication of how methods handle unseen manipulations or differing compression rates.
    • This omission obscures real-world performance considerations.

  • Inconsistent depth in literature coverage
    • The period “近五年” is addressed, but citation density varies; some architectures (e.g., contrastive learning–based ones from 2021–2023) are briefly noted without comparative discussion.
    • No empirical grouping by methodological lineage (CNN → attention → transformer → diffusion) is analytically explored beyond the taxonomy diagram.
    • Consequently, readers cannot discern which trendlines dominate quantitatively.

  • Equation and notation clarity
    • The attention formulation (Image 11) is explained qualitatively (Q, K, V, Softmax) but lacks explicit mathematical expressions for the scoring and normalization functions.
    • No loss-function equations or symbol tables were found; see Image 9 where loss components are described verbally only.
    • Absence of formal definitions hinders clarity for technically trained readers.


Suggestions for Improvement

  • Provide quantitative performance tables and analyses
    • Include summary tables comparing recent methods on shared datasets (CASIA v1, NIST16, Coverage, Columbia) with standard metrics (accuracy, F1, IoU).
    • Add numerical anchors to complement Image 12’s qualitative visuals, enabling statistical interpretation.
    • Discuss metric variability or standard deviations to convey robustness.

  • Deepen methodological exposition
    • Augment textual descriptions of major architectures (e.g., multi-domain fusion, attention-based, transformer-based) with simplified equations, module structures, or computational cost comparisons.
    • Incorporate representative pseudocode for a canonical framework (e.g., MVSS-Net) to improve reproducibility.
    • Explicitly discuss how different network designs handle boundary localization versus global context.

  • Expand generalization and robustness analysis
    • Introduce a subsection summarizing cross-dataset testing or domain-shift benchmarks, if available.
    • Discuss resilience to compression artifacts quantitatively; provide figure or case-study references where performance degrades.
    • Highlight open research directions for handling novel manipulations and adversarial perturbations.

  • Harmonize literature depth and selection criteria
    • Clarify inclusion/exclusion criteria for surveyed works (years 2019–2024, citation threshold, dataset relevance) to convey methodological balance.
    • Ensure parity in discussion length across architecture families to prevent perceived bias.
    • Add a compact table mapping each cited paper to its network type, dataset, and key contribution.

  • Strengthen mathematical formalism and notation
    • Formally define key variables in attention (Q, K, V) and fusion losses mentioned in Image 9–11 using explicit equations.
    • Provide consistent notation for predicted masks, losses, and evaluation metrics.
    • Add a notation table or appendix clarifying abbreviations and symbols to improve technical readability.


References
None