Papers
Sign in to view your remaining parses.
Tag Filter
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Published:11/26/2025
Long Video Reasoning FrameworkMultimodal Chain-of-Tool ReasoningLong Video Question Answering DatasetVisual Evidence Retrieval and ProcessingLarge Multimodal Model
LongVT introduces an endtoend framework enhancing long video reasoning via interleaved Multimodal ChainofToolThought, leveraging LMMs' temporal grounding. It releases the VideoSIAH dataset for training and evaluation, significantly improving performance on various benchmarks
03
DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning
Published:2/24/2025
Synthetic Demonstration GenerationData-Efficient Visuomotor Policy LearningSpatially Augmented Demonstration Generation3D Point Cloud SynthesisRobotic Manipulation with Machine Learning
DemoGen is a lowcost, fully synthetic method for generating augmented demonstrations to enhance visuomotor policy learning in robotics. It adapts a single human demonstration to new object configurations, significantly improving policy performance across challenging realworld t
05
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Published:11/26/2023
Diffusion ModelsVideo Generation ModelsText-to-Video GenerationHigh-Quality Video Fine-TuningVideo Dataset Curation
The paper presents Stable Video Diffusion (SVD), a model for highresolution texttovideo and imagetovideo generation. It evaluates a threestage training process and highlights the importance of wellcurated datasets for highquality video generation, demonstrating strong per
02
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Published:3/22/2024
Text-to-Video GenerationLong Video GenerationAutoregressive Video GenerationConditional Attention MechanismVideo Enhancement Application
This paper presents StreamingT2V, an autoregressive method for long video generation, addressing limitations of existing texttovideo models. It utilizes a Conditional Attention Module and Appearance Preservation Module for smooth transitions and scene feature retention, alongsi
02
SecureGPT: A Framework for Multi-Party Privacy-Preserving Transformer Inference in GPT
Published:1/1/2024
Multi-Party Privacy-Preserving Transformer InferenceSecure Natural Language Processing FrameworkGenerative Pretrained TransformerMulti-Party Private Protocol DesignPrivacy Preservation Mechanism Verification
SecureGPT is introduced as a multiparty privacypreserving framework for GPT inference, featuring protocols like M2A, truncation, division, softmax, and GELU. Demonstrating security under a semihonest model, it achieves up to 100× performance improvements, advancing privacy in
01
Masked Diffusion for Generative Recommendation
Published:11/28/2025
Generative Recommendation SystemsMasked Diffusion ModelSemantic ID ModelingSequential Recommender SystemsAutoregressive Modeling
This paper introduces Masked Diffusion for Generative Recommendation (MADRec), which models user interaction sequences using discrete masking noise, outperforming traditional autoregressive models in efficiency and performance, particularly in datarestricted and coarsegrained r
010
CoFiRec: Coarse-to-Fine Tokenization for Generative Recommendation
Published:11/28/2025
Generative Recommendation SystemsFine-grained User Preference ModelingAutoregressive Recommendation GenerationCoarse-to-Fine Semantic HierarchyEvaluation on Public Benchmarks
The CoFiRec framework enhances generative recommendation by incorporating a coarsetofine semantic hierarchy into the tokenization process, allowing for better modeling of user intent. Experiments demonstrate its superiority over existing methods across multiple benchmarks.
07
EAMamba: Efficient All-Around Vision State Space Model for Image Restoration
Published:6/27/2025
Image RestorationEfficient Vision State Space ModelMamba FrameworkMulti-Head Selective Scan ModuleLow-Level Computer Vision
EAMamba integrates a MultiHead Selective Scan Module and allaround scanning mechanism to tackle the computational complexity and local pixel forgetting issues in image restoration. Experiments show EAMamba significantly reduces FLOPs by 3189% while maintaining comparable perfo
01
NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution
Published:6/3/2025
RAW Image Restoration and Super-ResolutionImage Signal Processing PipelinesNTIRE 2025 ChallengeBayer Image UpscalingImage Denoising and Deblurring
This paper reviews the NTIRE 2025 Challenge on RAW Image Restoration and SuperResolution, detailing proposed methods for restoring noisy RAW images and upscaling Bayer images, with participation from 230 entrants and submissions from 45.
03
Attention-Guided Progressive Neural Texture Fusion for High Dynamic Range Image Restoration
Published:7/14/2021
High Dynamic Range Image RestorationAttention-Guided Texture FusionMulti-Exposure FusionNeural Feature Transfer MechanismProgressive Texture Blending Module
The proposed Attentionguided Progressive Neural Texture Fusion (APNTFusion) model for HDR image restoration effectively handles content association ambiguities from saturation, motion, and artifacts, outperforming existing methods through a twostream structure and progressive
02
Effective Measures to Improve Current Collection Quality for Double Pantographs and Catenary Based on Wave Propagation Analysis
Published:4/6/2020
Double Pantograph-Catenary InteractionWave Propagation Analysis Optimization MeasuresFinite Element Method ModelingCurrent Collection Quality ImprovementPantographs and Catenary in High-Speed Trains
This study proposes measures based on wave propagation analysis to improve current collection quality of double pantographs. Finite element model analysis reveals optimal spacing is linked to contact wire uplift velocity. Adding damping to the steady arm effectively reduces wave
05
Emerging Properties in Self-Supervised Vision Transformers
Published:4/29/2021
Self-Supervised Vision TransformersViT Feature LearningImage Semantic SegmentationDINO Self-Distillation MethodImageNet Dataset Evaluation
This paper investigates the unique contributions of selfsupervised learning (SSL) to Vision Transformers (ViTs) and introduces the DINO method. It finds that selfsupervised ViT features contain explicit semantic segmentation information and achieve 78.3% accuracy with kNN. DIN
02
LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders
Published:5/7/2025
Long Sequence ModelingIndustrial Recommender SystemsEfficient Transformer ArchitectureHybrid Attention MechanismsGPU-Optimized Recommender Systems
LONGER is a Transformer model designed for industrial recommender systems that captures ultralong user behavior sequences. Key innovations include a global token mechanism, a token merge module to reduce complexity, and engineering optimizations, demonstrating strong performance
02
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
RL Training for Large Language ModelsLong-Context ModelingLLM Reasoning Capacity EnhancementSparse Attention Mechanism
DeepSeekV3.2 is presented, balancing computational efficiency and reasoning capabilities through three innovations: a sparse attention mechanism that reduces complexity, a scalable reinforcement learning framework rivaling GPT5, and a synthesis pipeline enhancing generalization
072
MAGI-1: Autoregressive Video Generation at Scale
Published:5/19/2025
Autoregressive Video Generation ModelImage-to-Video GenerationEfficient Temporal Consistency ModelingControllable Video GenerationLong-Context Video Generation
MAGI1 is a largescale autoregressive video generation system that ensures causal temporal modeling and supports streaming. The model, with 24 billion parameters, achieves stateoftheart performance in imagetovideo tasks while maintaining memoryefficient inference.
09
Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement
Published:1/19/2020
Low-Light Image EnhancementZero-Reference Deep Curve EstimationDCE-NetImage-Specific Curve EstimationNon-Reference Loss Functions
This study introduces ZeroDCE, a method using a lightweight network for imagespecific curve estimation to enhance lowlight images without requiring paired data. It employs nonreference loss functions to effectively improve image quality, demonstrating good generalization acro
02
Pack and Force Your Memory: Long-form and Consistent Video Generation
Published:10/2/2025
Long-form Video GenerationMemoryPack MechanismContext ModelingAutoregressive Video ModelsError Propagation Mitigation
This paper introduces a new method for longform video generation, addressing longrange dependency capture and error accumulation in autoregressive decoding. The implementation of MemoryPack and Direct Forcing significantly enhances consistency and reliability in video generatio
03
CAMformer: Associative Memory is All You Need
Published:11/25/2025
CAMformer ArchitectureMemory-Based Attention MechanismEnergy-Efficient TransformerOptimization of BERT and Vision TransformerPhysical Similarity Sensing
CAMformer is a novel hardware accelerator that reinterprets attention as an associative memory operation using BACAM, achieving constanttime similarity search. It demonstrates over 10x energy efficiency and up to 4x higher throughput on BERT and Vision Transformer workloads whi
04
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Published:11/27/2025
Multimodal Image GenerationJoint Training of Diffusion ModelsText-to-Image Generation FrameworkCanvas Image GenerationMulti-Task Datasets
The paper presents CanvastoImage, a unified framework for highfidelity compositional image generation with multimodal controls, encoding diverse signals into a single composite canvas image. It introduces a MultiTask Canvas Training strategy, enhancing the model's understandi
04
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Published:10/6/2022
Text-to-Video GenerationLong Video GenerationVariable Length Video GenerationJoint Training on Image-Text PairsTemporal Encoding with Transformer
Phenaki is a model designed for generating variablelength videos from text prompts. It introduces a novel video representation method using causal attention and a bidirectional masked transformer, overcoming challenges of computational cost and data scarcity while improving temp
05
……