VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis
TL;DR Summary
VisionGRU introduces a novel linear-complexity RNN architecture using a simplified minGRU and hierarchical bidirectional scanning to efficiently analyze high-resolution images. This method achieves superior performance compared to ViTs while significantly reducing computational a
Abstract
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. While CNNs excel at extracting multi-scale features and ViTs effectively capture global dependencies, both suffer from high computational costs, particularly when processing high-resolution images. Recently, state-space models (SSMs) and recurrent neural networks (RNNs) have attracted attention due to their efficiency. However, their performance in image classification tasks remains limited. To address these challenges, this paper introduces VisionGRU, a novel RNN-based architecture designed for efficient image classification. VisionGRU leverages a simplified Gated Recurrent Unit (minGRU) to process large-scale image features with linear complexity. It divides images into smaller patches and progressively reduces the sequence length while increasing the channel depth, thus facilitating multi-scale feature extraction. A hierarchical 2DGRU module with bidirectional scanning captures both local and global contexts, improving long-range dependency modeling, particularly for tasks like semantic segmentation. Experimental results on the ImageNet and ADE20K datasets demonstrate that VisionGRU outperforms ViTs, significantly reducing memory usage and computational costs, especially for high-resolution images. These findings underscore the potential of RNN-based approaches for developing efficient and scalable computer vision solutions. Codes will be available at https://github.com/YangLiu9208/VisionGRU.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis
- Authors: Shicheng Yin, Kaixuan Yin, Weixing Chen, Enbo Huang, Yang Liu
- Affiliations: Sun Yat-sen University, Nanning Normal University. The authors are primarily from academic institutions, suggesting a research-focused contribution.
- Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a conference or journal but is shared to disseminate findings quickly.
- Publication Year: 2024 (based on the arXiv ID).
- Abstract: The paper introduces
VisionGRU, a novel architecture based on Recurrent Neural Networks (RNNs) for efficient image analysis. It aims to solve the high computational cost problem of dominant models like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), especially with high-resolution images. The core ofVisionGRUis a simplified Gated Recurrent Unit (minGRU) that processes image features with linear complexity. The architecture is hierarchical, reducing sequence length while increasing channel depth to extract multi-scale features. A key component is the2DGRUmodule, which uses bidirectional scanning to capture both local and global context. Experiments on ImageNet (classification) and ADE20K (segmentation) show thatVisionGRUoutperforms ViTs with significantly lower memory and computational costs. - Original Source Link:
- arXiv Link: https://arxiv.org/abs/2412.18178
- PDF Link: http://arxiv.org/pdf/2
Similar papers
Recommended via semantic vector search.