AiPaper
Paper status: completed

PatchWiper: Leveraging Dynamic Patch-Wise Parameters for Real-World Visible Watermark Removal

Published:10/25/2025
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The PatchWiper framework combines an independent watermark segmentation network with a dynamic patch restoration network for effective visible watermark removal. It generates unique parameters for each image patch and leverages a diverse dataset for comprehensive evaluation, show

Abstract

Visible watermark removal is crucial for evaluating watermark robustness and advancing more resilient protection techniques. Current methods face challenges in real-world scenarios due to architectural constraints in multi-task frameworks and limited dataset diversity. To address these challenges, we first propose a novel two-stage framework, PatchWiper, consisting of an independent watermark segmentation network and a highly dynamic patch-wise restoration network. This framework decouples watermark localization from background restoration, allowing each network to focus on its designated task. Our restoration network dynamically generates unique parameters for each image patch, enabling fine-grained adaptation to different watermark distortions. Second, we construct the Pixabay Real-world Watermark Dataset (PRWD), which incorporates diverse background images and over 1,000 distinct watermark types, providing a more comprehensive benchmark for evaluating watermark removal methods. Extensive experiments on PRWD, ILAW, and real-world testing images demonstrate our method’s superior performance over existing approaches, particularly in handling complex real-world cases.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

PatchWiper: Leveraging Dynamic Patch-Wise Parameters for Real-World Visible Watermark Removal

The title clearly states the paper's core contributions: a method named PatchWiper that uses dynamic parameters generated on a patch-by-patch basis to tackle the problem of visible watermark removal, with a specific focus on performing well in real-world scenarios.

1.2. Authors

  • Zihao Mo: Sun Yat-Sen University

  • Junye Chen: Sun Yat-Sen University

  • Chaowei Fang: Xidian University (Corresponding author)

  • Guanbin Li: Sun Yat-Sen University (Corresponding author)

    The authors are affiliated with well-regarded research institutions in China, specifically Sun Yat-Sen University and Xidian University. The research group has a history of publications in top-tier computer vision and multimedia conferences, focusing on image processing, video analysis, and deep learning.

1.3. Journal/Conference

ACM International Conference on Multimedia (MM '25)

This paper is published in the proceedings of ACM MM, which is a premier international conference in the field of multimedia. It is highly competitive and respected, covering topics that integrate multiple media types, such as images, text, video, and audio. Publication at ACM MM signifies that the work is considered to be of high quality and significant impact by peers in the field.

1.4. Publication Year

2025 (scheduled for presentation between October 27-31, 2025).

1.5. Abstract

The abstract introduces visible watermark removal as a critical task for assessing the robustness of watermarking techniques. It identifies two primary challenges with existing methods: 1) architectural limitations in multi-task frameworks that handle both watermark localization and background restoration, and 2) the lack of diverse datasets that reflect real-world complexity. To address these issues, the paper proposes two main contributions. First, a new two-stage framework called PatchWiper, which separates watermark localization and background restoration into two independent networks. The restoration network is highly dynamic, generating unique parameters for each image patch to achieve fine-grained restoration. Second, the authors introduce a new large-scale dataset, the Pixabay Real-world Watermark Dataset (PRWD), which contains over 1,000 different watermark types and a wide variety of background images. The abstract concludes that extensive experiments show PatchWiper's superiority over existing methods, especially on complex real-world images.

The paper provides the following link: /files/papers/6919e336110b75dcc59ae30f/paper.pdf. This is a relative link, suggesting the paper has been accepted for publication at the conference and is available through the conference proceedings portal.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the removal of visible watermarks from images. Visible watermarks are semi-transparent logos, text, or patterns overlaid on an image to protect copyright. While essential for intellectual property protection, their removal is studied to test the vulnerability of watermarking schemes and develop more robust ones.

The motivation for this work stems from significant gaps in prior research:

  1. Task Interference in Multi-task Frameworks: Many existing methods use a single network with a shared encoder to simultaneously locate the watermark (a segmentation task) and restore the background (a generation/inpainting task). The authors argue that these conflicting objectives cause mutual interference, where the features learned for localization are compromised by the need to learn restoration features, and vice-versa. This leads to imprecise watermark masks and, consequently, poor restoration quality.

  2. Lack of Adaptability in Restoration: Watermarks can have varying shapes, colors, opacities, and can interact with diverse background textures. This means the distortion is not uniform across the image. Previous methods often use static filters (CNNs) or simplistic dynamic mechanisms that apply the same restoration logic across large regions. These approaches lack the fine-grained adaptability needed to handle the complex, patch-specific variations in distortion, often resulting in blurriness or artifacts.

  3. Limited and Homogeneous Datasets: Existing datasets like CLWD and ILAW are limited in scale and diversity. They often use backgrounds from a single source (e.g., real-world photos) and a limited number of simple watermark patterns. Models trained on these datasets fail to generalize to the wide variety of watermarks and image types seen in the real world (e.g., AI-generated images, graphics with full-screen watermarks).

    The paper's innovative entry point is to tackle these challenges directly with a decoupled, highly dynamic, and data-centric approach.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. A Novel Two-Stage Framework (PatchWiper): They propose a framework that decouples the problem into two distinct stages:

    • Independent Watermark Localization: A dedicated segmentation network is used solely for predicting the watermark mask, avoiding interference from the restoration task.
    • Dynamic Patch-Wise Restoration: An innovative restoration network that generates unique neural network parameters for each individual image patch. This allows the model to adapt its restoration strategy with extreme granularity, effectively handling the varying levels of distortion across the watermarked region.
  2. A New Large-Scale Benchmark Dataset (PRWD): To address the data gap, they constructed the Pixabay Real-world Watermark Dataset (PRWD). This dataset is significantly larger (nearly 4x existing ones) and more diverse, featuring:

    • A wide range of background images, including natural photos, human-designed graphics, and AI-generated content.
    • Over 1,000 distinct watermark types, categorized as single logos, long-strip text, and full-screen patterns. This dataset serves as a more realistic and challenging benchmark for training and evaluating watermark removal models.
  3. State-of-the-Art Performance: Through extensive experiments, the paper demonstrates that PatchWiper significantly outperforms existing methods on both standard benchmarks (ILAW) and their new PRWD dataset. The model shows superior generalization capabilities when tested on real-world watermarked images sourced directly from the internet.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Visible Watermark Removal: This is a specialized task within the field of image restoration. The goal is to take an image with a visible, semi-transparent overlay (the watermark) and restore it to its original, clean state. It is a "blind" problem because the original clean image, the watermark pattern, and its opacity are unknown at test time. The task involves two sub-problems: localizing the watermark and inpainting the background content underneath it.

  • Multi-task Learning: A machine learning paradigm where a single model is trained to perform multiple related tasks simultaneously. The idea is that the model can leverage shared knowledge across tasks to improve performance. In the context of watermark removal, the two tasks are typically segmentation (finding the watermark mask) and image restoration (inpainting the background). While efficient, it can suffer from negative transfer or task interference, where optimizing for one task degrades performance on another. This paper's decoupled design is a direct response to this problem.

  • Dynamic Neural Networks (or Hypernetworks): A standard neural network learns a fixed set of weights (parameters) during training. In contrast, a dynamic neural network's parameters are generated on-the-fly, conditioned on the input. This is often achieved using a secondary network, called a hypernetwork, which takes the input data and outputs the weights for the main network. This allows the model to adapt its architecture or behavior for each specific input, making it highly flexible. PatchWiper uses this principle to generate unique parameters for each image patch.

  • U-Net Architecture: An encoder-decoder architecture widely used for biomedical image segmentation, but now common in many image-to-image translation tasks. It consists of a "contracting path" (encoder) that captures context and a "symmetric expanding path" (decoder) that enables precise localization. A key feature is the use of skip connections, which connect layers from the encoder to corresponding layers in the decoder, helping the network recover fine-grained details that might be lost during downsampling. PatchWiper uses a U-Net-like structure for its watermark localization network.

  • Transformers in Vision: Originally developed for natural language processing, Transformers have been successfully adapted for computer vision tasks. Their core mechanism, self-attention, allows the model to weigh the importance of all other pixels/patches when processing a given pixel/patch. This enables the model to capture long-range dependencies and global context, which is very useful for image restoration tasks where information from distant, uncorrupted regions can help restore a damaged area. The paper uses Restormer blocks, a transformer variant optimized for image restoration.

3.2. Previous Works

  • Early GAN-based Methods: Initial deep learning approaches treated watermark removal as a general image-to-image translation problem, using Generative Adversarial Networks (GANs). However, without explicit guidance on where the watermark is, these methods often produced blurry results or failed to remove the watermark completely.

  • Multi-task Frameworks: Later works recognized the importance of explicitly localizing the watermark.

    • WDNet [20]: A pioneering work that proposed a multi-task network to decompose an image into a watermark layer and a watermark-free background layer.
    • SplitNet [5]: Introduced a two-stage architecture to refine the removal process but still relied on a shared feature space, making it susceptible to task interference.
    • SLBR [17]: Proposed a Self-calibrated Mask Refinement (SMR) module to improve localization accuracy and used a multi-stage refinement process for better restoration. PatchWiper adopts the SMR module in its own dedicated localization network, demonstrating its effectiveness.
  • Dynamic Parameter Methods: More recent methods have explored dynamic networks to adapt to different watermarks.

    • DKSP [34]: This method generates dynamic convolutional kernel parameters based on the input image, allowing for some level of adaptation.
    • Li et al. [22]: This work improved upon DKSP by using a part-aware module for more region-specific processing.
    • Limitation: These methods generate a limited number of adaptive parameters (e.g., a few sets of kernels) that are applied to entire regions or the whole image. They lack the fine-grained, patch-level adaptability that PatchWiper introduces.
  • Image Inpainting Methods:

    • CoordFill [19]: An image inpainting method that operates on a patch-wise basis. It uses a coordinate-based query system where each patch's coordinates are used to query a learned feature space, and an MLP predicts the pixel values for that patch. PatchWiper's restoration network is heavily inspired by this patch-wise query paradigm but advances it by generating fully dynamic, patch-specific representations (parameters) instead of querying a static feature space.

3.3. Technological Evolution

The field has evolved from generic image-to-image translation to more specialized and adaptive solutions:

  1. GAN-based Translation: Initial, non-targeted approaches.
  2. Explicit Localization: Recognition that finding the watermark is a crucial first step, leading to multi-task learning frameworks (WDNet, SLBR).
  3. Dynamic Adaptation: Introduction of dynamic networks to handle watermark variety (DKSP).
  4. Decoupling and Fine-Grained Dynamics (PatchWiper): The current paper represents the next step, proposing to completely separate the localization and restoration tasks to eliminate interference and introducing an extremely adaptive restoration mechanism that generates unique parameters for every single patch.

3.4. Differentiation Analysis

Compared to previous methods, PatchWiper introduces two key innovations:

  • Complete Decoupling vs. Shared Encoders: While methods like SLBR refined localization, they still operated within a multi-task framework with shared features. PatchWiper uses two entirely separate networks, ensuring that the localization network is exclusively optimized for segmentation precision without being compromised by restoration objectives.
  • Patch-Wise Dynamic Parameters vs. Limited Dynamic Kernels: While DKSP and others introduced dynamic parameters, they were coarse-grained. PatchWiper's RGN-PQN architecture generates a unique set of parameters for every 4×44 \times 4 patch in the image. This provides an unprecedented level of adaptability, allowing the model to tailor its restoration process to the specific content and distortion level of each tiny local region.

4. Methodology

4.1. Principles

The core idea behind PatchWiper is to divide and conquer. The complex problem of blind watermark removal is broken down into two more manageable, specialized sub-problems: precise localization and adaptive restoration.

  1. For localization: The principle is that a network dedicated solely to segmentation will outperform a multi-task network by avoiding conflicting learning signals. By using multi-scale supervision, it further refines its ability to capture watermark boundaries accurately.

  2. For restoration: The principle is that watermark distortion is highly localized and variable. A "one-size-fits-all" restoration filter is suboptimal. Therefore, the model should generate a custom restoration filter (represented by MLP parameters) for every small patch, using both global image context and local patch information.

    The overall architecture is depicted in the paper's Figure 2.

    该图像是一个示意图,展示了PatchWiper框架的结构,包括水印定位网络(WLN)和补全网络。图中展示了位置编码、重塑过程以及多层感知器(MLP)的使用,强调了动态生成参数的关键步骤,以便对不同的水印失真进行细致的适应。 该图像是一个示意图,展示了PatchWiper框架的结构,包括水印定位网络(WLN)和补全网络。图中展示了位置编码、重塑过程以及多层感知器(MLP)的使用,强调了动态生成参数的关键步骤,以便对不同的水印失真进行细致的适应。

4.2. Core Methodology In-depth (Layer by Layer)

Given a watermarked image IwRH×W×3I^w \in \mathbb{R}^{H \times W \times 3}, the goal is to produce a watermark-free image I^RH×W×3\hat{I} \in \mathbb{R}^{H \times W \times 3}. This is achieved in two stages.

4.2.1. Stage 1: Watermark Localization

This stage uses a dedicated segmentation network to predict a binary watermark mask M^RH×W×1\hat{M} \in \mathbb{R}^{H \times W \times 1}.

  • Architecture: The network has a U-Net-like encoder-decoder structure. The decoder is enhanced with Self-Calibrated Mask Refinement (SMR) modules from SLBR [17] in its final three layers. The SMR modules use attention mechanisms to refine the mask prediction by better capturing global context and boundary details.

  • Multi-Scale Supervision and Loss Function: To improve accuracy, the training process supervises not only the final output but also the intermediate outputs from the SMR layers. Each SMR layer produces a primary mask MpM^p and a self-calibrated mask MscM^{sc}. The total loss for the localization network, Lmask\mathcal{L}_{\mathrm{mask}}, is a combination of three components:

    1. Primary Mask Loss (Lp\mathcal{L}_{\mathrm{p}}): This is a binary cross-entropy (BCE) loss applied to the primary masks from the three SMR layers. A decay factor γ\gamma is used to give more weight to the deeper, more refined layers. Lp=i=13γi1j=1N[Mjlog(Mi,jp)+(1Mj)log(1Mi,jp)], \mathcal { L } _ { \mathrm { p } } = - \sum _ { i = 1 } ^ { 3 } \gamma ^ { i - 1 } \sum _ { j = 1 } ^ { N } \left[ M _ { j } \log ( M _ { i , j } ^ { p } ) + \left( 1 - M _ { j } \right) \log ( 1 - M _ { i , j } ^ { p } ) \right] ,

      • ii: Index of the SMR layer (from 1 to 3).
      • NN: Total number of pixels in the image.
      • MjM_j: The ground-truth binary value (0 or 1) for pixel jj.
      • Mi,jpM_{i,j}^p: The predicted probability from the primary mask of the ii-th layer for pixel jj.
      • γ\gamma: A decay factor (set to 0.5) to control the supervision intensity at different layers.
    2. Self-Calibrated Mask Loss (Lsc\mathcal{L}_{\mathrm{sc}}): This loss is applied to the refined masks from the SMR layers. It combines BCE loss with an Intersection-over-Union (IoU) loss to improve boundary precision. Lsc=i=13γi1j=1N[Mjlog(Mi,jsc)+(1Mj)log(1Mi,jsc)]+λioui=13γi1[1j=1NMjMi,jscj=1N(Mj+Mi,jscMjMi,jsc)], \begin{array} { r } { \mathcal { L } _ { \mathrm { s c } } = - \displaystyle \sum _ { i = 1 } ^ { 3 } \gamma ^ { i - 1 } \sum _ { j = 1 } ^ { N } \left[ M _ { j } \log ( M _ { i , j } ^ { s c } ) + ( 1 - M _ { j } ) \log ( 1 - M _ { i , j } ^ { s c } ) \right] } \\ { + \lambda _ { \mathrm { iou } } \displaystyle \sum _ { i = 1 } ^ { 3 } \gamma ^ { i - 1 } \left[ 1 - \frac { \sum _ { j = 1 } ^ { N } M _ { j } M _ { i , j } ^ { s c } } { \sum _ { j = 1 } ^ { N } ( M _ { j } + M _ { i , j } ^ { s c } - M _ { j } M _ { i , j } ^ { s c } ) } \right] , } \end{array}

      • Mi,jscM_{i,j}^{sc}: The predicted probability from the self-calibrated mask of the ii-th layer.
      • The second term is the IoU loss, where the fraction calculates the IoU metric.
      • λiou\lambda_{\mathrm{iou}}: A weight (set to 0.25) to balance the BCE and IoU terms.
    3. Final Mask Loss (Lf\mathcal{L}_{\mathrm{f}}): A standard BCE loss applied to the final output mask M^\hat{M} of the network.

      The complete loss for the localization network is the weighted sum of these three losses: Lmask=Lf+Lsc+λpLp, \mathcal { L } _ { \mathrm { m a s k } } = \mathcal { L } _ { \mathrm { f } } + \mathcal { L } _ { \mathrm { s c } } + \lambda _ { p } \mathcal { L } _ { \mathrm { p } } ,

    • λp\lambda_p: A balancing weight (set to 0.01).

4.2.2. Stage 2: Dynamic Background Restoration

This stage takes the watermarked image IwI^w and the predicted mask M^\hat{M} as input and restores the background content. It consists of two coordinated networks: the Representation Generation Network (RGN) and the Patch Query Network (PQN).

Representation Generation Network (RGN)

The RGN's job is to generate a unique set of parameters (called a "representation") for every patch in the image.

  • Process:

    1. The watermarked image IwI^w and the predicted mask M^\hat{M} are concatenated along the channel dimension.
    2. This combined input is fed into a stack of hierarchical Restormer blocks [32]. Restormer is a Transformer-based model designed for image restoration that excels at capturing global dependencies and contextual features.
    3. The output feature map from the Restormer blocks is reshaped and passed through a Feed-Forward Network (FFN) to produce the final set of representations Θ\Theta.
  • Formula: Θ=FFN(R(f([Iw,M^]))), \Theta = \operatorname { FFN } ( \mathcal { R } ( f ( [ I ^ { w } , \hat { M } ] ) ) ) ,

    • [Iw,M^][I^w, \hat{M}]: Concatenation of the image and mask.
    • f()f(\cdot): The stack of Restormer blocks.
    • R()\mathcal{R}(\cdot): A reshape operation.
    • Θ\Theta: The set of generated representations, Θ={θii=1,2,...,HW16}\Theta = \{ \theta_i | i=1, 2, ..., \frac{HW}{16} \}, where each θi\theta_i contains the parameters for the MLP that will restore the ii-th patch.

Patch Query Network (PQN)

The PQN uses the representations generated by the RGN to perform the actual restoration for each patch.

  • Process:
    1. Input Embedding Creation: For each patch, a rich input embedding is created. This embedding provides the network with information about the patch's location, color, and whether it's part of the watermark.

      • Positional Encoding: Sinusoidal positional encoding is used to make the network aware of each pixel's coordinates. The formula for the position embedding EP^E_{\hat{P}} is: EP^=(sin(2π(ρxmodHP^)HP^),cos(2π(ρxmodHP^)HP^),sin(2π(ρymodWP^)WP^),cos(2π(ρymodWP^)WP^)), \begin{array} { r } { E _ { \hat { P } } = \Bigg ( \sin \Bigg ( \frac { 2 \pi \big ( \rho _ { x } \bmod H _ { \hat { P } } \big ) } { H _ { \hat { P } } } \Bigg ) , \cos \Bigg ( \frac { 2 \pi \big ( \rho _ { x } \bmod H _ { \hat { P } } \big ) } { H _ { \hat { P } } } \Bigg ) , } \\ { \sin \Bigg ( \frac { 2 \pi \big ( \rho _ { y } \bmod W _ { \hat { P } } \big ) } { W _ { \hat { P } } } \Bigg ) , \cos \Bigg ( \frac { 2 \pi \big ( \rho _ { y } \bmod W _ { \hat { P } } \big ) } { W _ { \hat { P } } } \Bigg ) \Bigg ) , } \end{array}
        • (ρx,ρy)(\rho_x, \rho_y): The 2D coordinate of a pixel within a patch.
        • (HP^,WP^)(H_{\hat{P}}, W_{\hat{P}}): The height and width of the patch (which is 4×44 \times 4 in the final model).
      • This position embedding is concatenated with the original RGB values from IwI^w and the mask value from M^\hat{M} for each pixel, forming a comprehensive patch embedding map EE.
    2. Patch Division: The embedding map EE is divided into a grid of non-overlapping 4×44 \times 4 patches. Each patch is flattened to form a query vector qiq_i. This results in a set of queries Q={qi}Q = \{ q_i \}.

    3. Dynamic Restoration: For each ii-th patch, its query vector qiq_i is fed into a Multi-Layer Perceptron (MLP). Crucially, the weights and biases of this MLP are not fixed; they are the representation θi\theta_i generated by the RGN for this specific patch. Pib=MLP(qi,θi). P _ { i } ^ { b } = \mathrm { MLP } ( q _ { i } , \theta _ { i } ) .

      • qiq_i: The input query vector for the ii-th patch.
      • θi\theta_i: The dynamically generated parameters for the MLP, from the RGN.
      • PibP_i^b: The restored 4×44 \times 4 background patch.

4.2.3. Final Image Assembly

After all patches are restored, they are reassembled to form the complete restored background image IbI^b. The final watermark-free image I^\hat{I} is then composed by blending the restored background with the original image, using the predicted mask M^\hat{M} to combine them. This ensures that only the watermarked areas are replaced, preserving the original quality of the un-watermarked regions.

  • Formula: I^=IbM^+Iw(1M^), \hat { I } = I ^ { b } \odot \hat { M } + I ^ { w } \odot ( 1 - \hat { M } ) ,

    • \odot: Element-wise multiplication.
    • IbM^I^b \odot \hat{M}: The restored background, kept only where the watermark exists.
    • Iw(1M^)I^w \odot (1 - \hat{M}): The original image, kept only where there is no watermark.
  • Supervision: The entire restoration network (RGN and PQN) is trained end-to-end by minimizing the L1\mathcal{L}_1 loss between the final output I^\hat{I} and the ground-truth watermark-free image.

4.2.4. The PRWD Benchmark

A key contribution is the new dataset, designed to be more realistic and challenging.

  • Data Collection:
    • Backgrounds: 223,278 images were collected from Pixabay, covering a wide range of categories including real photos, graphics, and AI-generated art.
    • Watermarks: 1,129 distinct watermarks were collected or created, falling into three types: single logos, long-strip texts (like URLs), and full-screen tiled patterns.
  • Synthesis Process: The watermarked images are synthesized using alpha blending. Iw=P(Ib(1αM)+wαM), \boldsymbol { I } ^ { w } = \mathcal { P } ( \boldsymbol { I } ^ { b } \odot ( 1 - \alpha M ) + w \odot \alpha M ) ,
    • Iw,Ib,w,M,αI^w, I^b, w, M, \alpha: The watermarked image, background, watermark pattern, watermark mask, and opacity level, respectively.

    • P()\mathcal{P}(\cdot): A post-processing step that applies JPEG compression to simulate real-world artifacts.

      The paper provides examples of the diverse images in PRWD in Figure 3.

      Figure 3: Visualization of watermarked images from PRWD dataset. Rows show (top to bottom) real-world photographs, human-designed graphics, and AI-generated content backgrounds. Columns show (left to… 该图像是插图,展示了多种水印样例,包括真实照片、人造图形及AI生成背景的水印。每一列分别展示了单一标志、长条文本和全屏水印,提供了多样化的视觉效果。

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three benchmark datasets to demonstrate the method's effectiveness and generalization ability.

  • CLWD (Colored Large-scale Watermark Dataset): Contains 60,000 training images and 10,000 test images. Backgrounds are from the PASCAL VOC2012 dataset. Watermarks are primarily logo-based with randomized size, location, rotation, and opacity.
  • ILAW (Images with Large Area Watermarks): A more challenging dataset with 60,000 training images and 10,000 test images. Backgrounds are from the Places365 dataset. The watermarks cover a larger area of the image and have higher opacity, making removal more difficult.
  • PRWD (Pixabay Real-world Watermark Dataset): The new dataset proposed in this paper. It has 189,768 training images and 33,492 test images. It is distinguished by its diversity in both backgrounds (natural, designed, AI-generated) and watermark types (logo, text, full-screen). It also includes a small real-world test set of 49 images with no ground truth for user studies.

5.2. Evaluation Metrics

A comprehensive set of metrics was used to evaluate performance from different perspectives.

For Image Restoration Quality:

  • PSNR (Peak Signal-to-Noise Ratio):

    1. Conceptual Definition: Measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. In image processing, it quantifies the quality of reconstruction by measuring the difference between the original and reconstructed images. Higher PSNR generally indicates better reconstruction quality.
    2. Mathematical Formula: PSNR=10log10(MAXI2MSE) \text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right)
    3. Symbol Explanation:
      • MAXI\text{MAX}_I: The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image).
      • MSE\text{MSE}: The Mean Squared Error between the ground-truth image II and the predicted image I^\hat{I}.
  • SSIM (Structural Similarity Index Measure):

    1. Conceptual Definition: A perceptual metric that measures image quality degradation as a perceived change in structural information. It compares three components: luminance, contrast, and structure, making it closer to human perception of quality than PSNR. The value ranges from -1 to 1, where 1 indicates perfect similarity.
    2. Mathematical Formula: SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2) \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
    3. Symbol Explanation:
      • x, y: The two image windows being compared.
      • μx,μy\mu_x, \mu_y: The average of xx and yy.
      • σx2,σy2\sigma_x^2, \sigma_y^2: The variance of xx and yy.
      • σxy\sigma_{xy}: The covariance of xx and yy.
      • c1,c2c_1, c_2: Small constants to stabilize the division.
  • RMSE (Root-Mean-Square Error):

    1. Conceptual Definition: Measures the standard deviation of the residuals (prediction errors). It quantifies the average magnitude of the error between predicted and actual pixel values. Lower RMSE indicates better performance.
    2. Mathematical Formula: RMSE=1Ni=1N(IiI^i)2 \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(I_i - \hat{I}_i)^2}
    3. Symbol Explanation:
      • Ii,I^iI_i, \hat{I}_i: The ground-truth and predicted values for the ii-th pixel.
      • NN: The total number of pixels.
  • RMSEw (Weighted RMSE): This is the same as RMSE, but the calculation is performed only within the watermarked area defined by the ground-truth mask. This specifically measures how well the model restores the corrupted regions.

  • LPIPS (Learned Perceptual Image Patch Similarity):

    1. Conceptual Definition: A metric that aims to better align with human perception of image similarity. It uses a pre-trained deep neural network (like VGG or AlexNet) to extract features from two images at multiple layers. The distance between these feature representations is then calculated. A lower LPIPS score indicates that two images are more perceptually similar.
    2. Mathematical Formula: The distance is typically computed as a weighted sum of 2\ell_2 distances over feature stacks from different layers. d(x,x0)=l1HlWlh,wwl(fhwlf0,hwl)22 d(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \| w_l \odot (f^l_{hw} - f^l_{0,hw}) \|_2^2
    3. Symbol Explanation:
      • fhwl,f0,hwlf^l_{hw}, f^l_{0,hw}: The feature activations at layer ll for images xx and x0x_0.
      • wlw_l: A learned weight for each channel at layer ll.

For Watermark Localization Quality:

  • IoU (Intersection over Union): A standard metric for segmentation tasks. It measures the overlap between the predicted mask and the ground-truth mask.
  • F1 Score: The harmonic mean of precision and recall, providing a balanced measure of a model's accuracy in mask prediction.

5.3. Baselines

PatchWiper was compared against a diverse set of models:

  • Specialized Watermark Removal Methods: Li et al. [16] (a cGAN-based method), BVMR [10], WDNet [20], SplitNet [5], DENet [26], and SLBR [17]. These represent the primary state-of-the-art in the field.
  • General Image Restoration Methods: U-Net [24] and Restormer [32]. These models are powerful baselines for general restoration but are not specifically designed for watermark removal.
  • Image Inpainting Methods: CoordFill [19] and LaMa [27]. These are state-of-the-art inpainting methods, relevant because watermark removal is a form of inpainting.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents quantitative results on the PRWD and ILAW datasets, qualitative comparisons, and a user study on real-world images.

Quantitative Evaluation on PRWD and ILAW

The results from Tables 1 and 2 demonstrate PatchWiper's superior performance across all metrics.

  • On PRWD (Table 1): PatchWiper achieves the highest PSNR (39.38) and SSIM (0.9839), and the lowest RMSE (3.27) and RMSEw (12.20). This shows its effectiveness on the new, diverse benchmark. It significantly outperforms both general restoration methods like Restormer and specialized methods like SLBR.
  • On ILAW (Table 2): The performance gap is even more pronounced on the challenging ILAW dataset. PatchWiper achieves a PSNR of 32.00, a massive improvement over the next best method, Leng et al. [13] (25.77). This large margin highlights the strength of the dynamic patch-wise approach in handling large-area, high-opacity watermarks, where fine-grained adaptation is critical.

Qualitative Evaluation

Figure 4 provides a visual comparison of the results.

该图像是示意图,展示了不同水印去除方法的对比,包括输入图像、GT(真实图像)、PatchWiper、SLBR和Restormer等多种方法的效果。通过这些示例,展示了不同算法在处理不同背景和水印方面的性能差异。 该图像是示意图,展示了不同水印去除方法的对比,包括输入图像、GT(真实图像)、PatchWiper、SLBR和Restormer等多种方法的效果。通过这些示例,展示了不同算法在处理不同背景和水印方面的性能差异。

In these examples, baseline methods exhibit common failure modes:

  • Incomplete Removal: Methods like BVMR and Restormer often leave behind color casts or textural remnants of the watermark (e.g., the green pixels on the lizard's face in the fourth row).
  • Detail Destruction: In an attempt to remove the watermark, some methods inadvertently destroy fine background details. For instance, in the second row, baselines remove the overhead wires along with the dark watermark, while PatchWiper successfully preserves them. PatchWiper consistently produces cleaner results with better-preserved background textures and fewer artifacts, validating the effectiveness of its fine-grained restoration.

Watermark Localization Evaluation

Table 4 shows that the dedicated localization network in PatchWiper achieves the highest IoU and F1 scores on both PRWD and CLWD datasets. This confirms the hypothesis that decoupling the localization and restoration tasks leads to more precise mask prediction, which is a crucial prerequisite for high-quality restoration.

Real-world Watermark Removal and User Study

The user study (Table 5) provides compelling evidence of PatchWiper's generalization to real-world scenarios.

  • Impact of Dataset: Both PatchWiper and SLBR perform much better when trained on the new PRWD dataset compared to CLWD. This proves that the diversity of PRWD is key to improving real-world generalization.

  • Superiority of PatchWiper: When both are trained on PRWD, PatchWiper was preferred by users in 36 out of 49 images and received significantly more total votes (2183 vs. 1521 for SLBR). The visual results in Figure 5 corroborate this, showing PatchWiper producing sharper details and fewer artifacts on unseen internet images.

    Figure 5: Visualization results on real-world watermark removal. Odd rows show restoration results while even rows display the corresponding predicted watermark masks. 该图像是可视水印去除的结果可视化。奇数行展示了修复后的图像,偶数行显示了相应的预测水印掩码。

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Table 1: The results of different methods on PRWD dataset. The method marked with * is provided with precise masks. The best results are denoted in boldface.
Method PSNR↑ SSIM↑ RMSE↓ RMSEw↓
CoordFill [19]* 35.90 0.9732 5.51 23.04
U-Net [24] 28.16 0.9354 12.16 51.67
Restormer [32] 37.69 0.9817 4.09 16.08
Li et al. [16] 26.36 0.8847 13.31 40.82
BVMR [10] 29.88 0.9294 8.79 16.50
WDNet [20] 36.68 0.9803 4.60 14.95
SLBR [17] 38.74 0.9805 3.46 12.80
PatchWiper(Ours) 39.38 0.9839 3.27 12.20

The following are the results from Table 2 of the original paper:

Table 2: The results of different methods on ILAW dataset. The method marked with * is provided with coarse masks. The best results are denoted in boldface.
Method Params(M) PSNR↑ SSIM↑ LPIPS↓
LaMa [27]* 88.09 17.97 0.677 0.326
CoordFill [20]* 34.50 22.66 0.819 0.149
DENet [26] 22.42 19.66 0.814 0.236
WDNet [20] 21.00 24.37 0.887 0.166
SplitNet [5] 32.61 25.72 0.892 0.156
SLBR [17] 21.39 25.02 0.890 0.154
Leng et al. [13] 97.95 25.77 0.916 0.100
PatchWiper(Ours) 33.25 32.00 0.947 0.068

The following are the results from Table 4 of the original paper:

Table 4: Quantitative evaluation of watermark masks predicted by our method and baselines on PRWD and CLWD [20] datasets.
Method PRWD CLWD
IoU↑ F1↑ IoU↑ F1↑
WDNet [20] 0.7688 0.8688 0.6120 0.7240
BVMR [10] 0.7571 0.8612 0.7021 0.7871
SplitNet [5] - - 0.7196 0.8027
SLBR [17] 0.8083 0.8798 0.7463 0.8234
DKSP [34] - - 0.7730 0.8480
Li et al. [22] - - 0.7909 0.8634
PatchWiper(Ours) 0.8177 0.8996 0.8042 0.8914

The following are the results from Table 5 of the original paper:

Table 5: User study on the real-world dataset.
Method & Dataset Vote Top-Ranked Count
SLBR [17] trained on CLWD [29] 180 0
PatchWiper trained on CLWD [29] 316 3
SLBR [17] trained on PRWD 1521 10
PatchWiper trained on PRWD 2183 36

6.3. Ablation Studies / Parameter Analysis

The ablation studies in Table 3 systematically validate each design choice in the dynamic restoration network.

The following are the results from Table 3 of the original paper:

Table 3: Ablation studies for the dynamic restoration network on PRWD. The best results are in bold.
Encoding Patch Size Backbone Decoder Params(M) PSNR↑ SSIM↑ RMSE↓ RMSEw↓
Position 8 AttFFC [19] PQN 51.50 35.28 0.9725 5.87 24.83
Position 8 Transformer Block PQN 40.30 38.70 0.9819 3.52 13.33
- - Transformer Block CNN 55.46 28.23 0.9393 12.11 51.91
- - Transformer Block Transformer Block 33.41 38.28 0.9837 3.87 15.22
Position 4 Transformer Block PQN 33.19 39.07 0.9830 3.37 12.65
Position + Mask 4 Transformer Block PQN 33.20 39.17 0.9831 3.33 12.48
Position + RGB 4 Transformer Block PQN 33.24 39.21 0.9835 3.30 12.39
Position + RGB + Mask 4 Transformer Block PQN 33.25 39.38 0.9839 3.27 12.20
  • Effect of Transformer Backbone: Replacing the AttFFC backbone (Row 1) with Restormer transformer blocks (Row 2) leads to a massive PSNR jump from 35.28 to 38.70. This confirms that the transformer's ability to model long-range dependencies is superior for generating high-quality representations from the global image context.
  • Effect of PQN Decoder: Replacing the dynamic PQN decoder with a standard CNN (Row 3) or Transformer Block decoder (Row 4) results in significantly worse or slightly inferior performance. The CNN decoder fails completely (28.23 PSNR), showing that a static decoder is unsuitable. The PQN outperforms the transformer decoder, proving that the dynamic, per-patch MLP generation is more effective than a standard attention-based decoder for this task.
  • Effect of Patch Size: Reducing the patch size from 8×88 \times 8 (Row 2) to 4×44 \times 4 (Row 5) improves the PSNR from 38.70 to 39.07. This indicates that smaller patches allow for a more fine-grained and accurate restoration of watermark details and boundaries.
  • Effect of Input Encoding: Rows 5-8 show the cumulative benefit of adding different information to the PQN's input query.
    • Adding the mask (Row 6) provides explicit spatial guidance on where to restore.
    • Adding the original RGB values (Row 7) allows the network to use uncorrupted pixels at patch boundaries as strong priors for restoration.
    • Combining all three—Position + RGB + Mask (Row 8)—achieves the best results. This shows that the network synergistically uses location, color, and mask information for the most precise patch-wise recovery.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper makes significant contributions to the field of visible watermark removal. It introduces PatchWiper, a novel two-stage framework that effectively addresses the core challenges of existing methods. By decoupling watermark localization from background restoration, it achieves state-of-the-art precision in identifying watermarked regions. The key innovation lies in its dynamic patch-wise restoration network, which generates unique parameters for each image patch, enabling an unprecedented level of adaptability to local variations in distortion. Furthermore, the paper introduces the PRWD benchmark, a large-scale and diverse dataset that better reflects real-world complexities and will facilitate future research. The extensive experimental results and user studies conclusively demonstrate that PatchWiper outperforms previous methods, especially in challenging real-world scenarios.

7.2. Limitations & Future Work

The paper does not explicitly state its limitations. However, we can infer some potential areas for future work:

  • Computational Complexity: The dynamic generation of MLP parameters for every single patch in an image could be computationally expensive, potentially limiting its application in real-time or resource-constrained environments. Future work could explore more efficient ways to generate or share parameters without sacrificing performance.
  • Error Propagation: As a two-stage method, errors made in the first (localization) stage will inevitably propagate to the second (restoration) stage. A small error in the predicted mask could lead to either parts of the watermark being left behind or clean background areas being unnecessarily restored, potentially introducing artifacts.
  • Extremely Challenging Cases: The method might still struggle with watermarks that are almost completely opaque, or those that are semantically entangled with the background (e.g., a text watermark over a book page), where distinguishing the watermark from the content is nearly impossible even for humans.

7.3. Personal Insights & Critique

  • Strengths:

    • The decoupling principle is a simple yet powerful idea that directly addresses a well-known weakness in multi-task learning. This is a solid engineering choice that leads to demonstrably better localization.
    • The patch-wise dynamic network is the most impressive innovation. It provides a highly intuitive solution to the problem of non-uniform image distortion. This concept of "hyper-local" adaptation is very powerful and could be transferred to other image restoration tasks like scratch removal, artifact correction, or rain streak removal, where degradations are often localized and varied.
    • The creation of the PRWD dataset is a major contribution in its own right. High-quality, large-scale, and diverse datasets are often the primary drivers of progress in deep learning. By including AI-generated content and more complex watermark types, PRWD pushes the field to build more robust and generalizable models.
  • Critique and Areas for Improvement:

    • The two-stage pipeline, while effective, lacks the elegance of an end-to-end solution. Exploring ways to achieve the benefits of decoupling (e.g., through carefully designed loss functions or architectural constraints) within a single end-to-end framework could be a fruitful research direction.

    • The paper focuses on visible watermarks. The techniques, especially the dynamic restoration network, could potentially be adapted for removing invisible watermarks or other forms of steganography, although this would require a different localization approach.

    • A deeper analysis of the failure cases would be beneficial. Understanding precisely when and why PatchWiper fails could provide valuable insights for designing the next generation of removal algorithms.

      Overall, "PatchWiper" is a strong paper that presents a comprehensive and well-executed solution to a practical problem, backed by a significant data contribution that will benefit the entire research community.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.