Papers
Sign in to view your remaining parses.
Tag Filter
MAGVIT: Masked Generative Video Transformer
Published:12/10/2022
Generative Video TransformerVideo Synthesis TasksKinetics-600 BenchmarkSpatial-Temporal Visual TokensMulti-Task Learning
MAGVIT is introduced as a unified model for various video synthesis tasks. It employs a 3D tokenizer for efficient video quantization and a masked modeling approach for multitask learning, showcasing superior quality and efficiency across benchmarks.
02
Learning-based legged locomotion: State of the art and future perspectives
Published:1/22/2025
Learning-based Quadrupedal LocomotionBipedal Locomotion for Humanoid RobotsDeep Learning and Robotic System SimulationHistory and Current Status of Learning Locomotion SkillsApplication of Action Learning in Robotics
This paper reviews recent advances in learningbased legged locomotion, addressing its history, current status, and future directions. It highlights the roles of deep learning, robotic simulation, and hardware improvements in learning skills for quadrupeds and bipeds, emphasizing
01
Taming Transformers for High-Resolution Image Synthesis
Published:12/18/2020
Generative Adversarial Policy OptimizationDiffusion ModelsImage Super-resolutionImage Synthesis
This study combines CNN's inductive bias with Transformer expressivity to synthesize highresolution images. It first learns a contextrich vocabulary of image constituents with CNNs, then models their composition using Transformers, achieving stateoftheart results in semantic
04
VideoGPT: Video Generation using VQ-VAE and Transformers
Published:4/21/2021
Video Generation ModelsVQ-VAE and Transformer Combined ApplicationBAIR Robot DatasetUCF-101 DatasetAutoregressive Generation Models
This paper introduces VideoGPT, a video generation model utilizing VQVAE and a simple transformer architecture, achieving competitive sample quality with stateoftheart GANs on various datasets and providing a reproducible reference for future research.
04
CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image
Published:2/18/2025
3D Gaussian SplattingSparse Voxel-based 3D Generative ModelPhysics-Informed Neural NetworksDynamic Graph Storage System
CAST is a novel method for 3D scene reconstruction from a single RGB image, extracting objectlevel segmentation and depth information, analyzing spatial relationships with a GPT model, and employing an occlusionaware generation model for accurate object geometry, ensuring physi
03
UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration
Published:10/28/2025
Generative Recommendation SystemsKnowledge Graph-based RecommendationPersonalized Recommendation SystemMultimodal Recommendation SystemsOnline Recommendation System Optimization
The paper introduces UNGER, a generative recommendation approach that integrates semantic and collaborative information into a unified code to reduce storage and inference costs. Utilizing a twophase framework for effective code construction, it demonstrates significant improvem
06
REB-former: RWKV-enhanced E-branchformer for Speech Recognition
Published:8/17/2025
Speech Recognition ModelE-BranchformerRWKV Enhancement MechanismLibriSpeech DatasetAttention Mechanism Optimization
The REBformer model enhances EBranchformer using RWKV to address quadratic complexity in selfattention. By interleaving EBranchformer and RWKV layers and introducing the GroupBiRWKV module, it achieves stateoftheart performance with up to 7.1% lower WER on the LibriSpeech
03
FICLRec: Frequency enhanced intent contrastive learning for sequential recommendation
Published:6/11/2025
Sequential Recommender SystemsFrequency Enhanced Intent Contrastive LearningUser Purchasing Behavior ModelingData Sparsity IssueReal-World Recommendation Datasets
FICLRec, a proposed model, uses frequencyenhanced intent contrastive learning to address the limitations of capturing highfrequency intents in sequential recommendation. It significantly improves performance across five realworld datasets.
04
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Published:11/15/2025
Unified Multimodal ModelsInteractive Image Comprehension and GenerationCross-Modality Reasoning and MemoryWEAVE-100k DatasetWEAVEBench Benchmark
This paper introduces WEAVE, a pioneering suite for incontext interleaved comprehension and generation, comprising the WEAVE100k dataset and WEAVEBench benchmark, which significantly improves models' visual understanding, image editing, and collaborative generation capabilities
07
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Published:11/3/2025
Unified Multimodal Model EvaluationInteractive Cross-Modal ReasoningHuman-Annotated BenchmarkingMultimodal Generation TasksInterleaved Reasoning Models
The paper introduces ROVER, a benchmark for evaluating reciprocal crossmodal reasoning in Unified Multimodal Models. It features 1,312 tasks assessing how one modality guides another, revealing significant performance differences in physical and symbolic reasoning.
02
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Published:10/31/2025
Zero-Shot Reasoning Evaluation of Video ModelsMME-CoF Benchmark DatasetReasoning Abilities of Video Generation ModelsTemporal Consistency ModelingChain-of-Frame Reasoning
This study assesses whether video generation models like Veo3 can function as zeroshot reasoners, introducing the ChainofFrame reasoning concept and creating the MMECoF benchmark. Results reveal strong shortterm coherence but significant limitations in longterm reasoning a
05
Spatial Context Energy Curve-Based Multilevel 3-D Otsu Algorithm for Image Segmentation
Published:6/4/2019
3D Image Segmentation AlgorithmOtsu Threshold SegmentationContextual Feature ModelingImage Quality EnhancementLow SNR Image Processing
This paper presents a multilevel 3D Otsu algorithm based on spatial context energy curves, improving segmentation results for lowcontrast and lowSNR images by integrating pixel intensity with spatial information. Experimental results show superior performance across various me
03
FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models
Published:10/2/2025
Vision-Language-Action ModelRobotic Manipulation Failure RecoveryFailure Generation and Recovery SystemRobotic Manipulation DatasetLarge-Scale Robot Training Data
The paper introduces FailSafe, a system that generates diverse failure scenarios and executable recovery actions for VisionLanguageAction (VLA) models, achieving up to 22.6% performance improvement in robotic failure recovery across various tasks.
08
Multilevel Thresholding for Image Segmentation Using Mean Gradient
Published:2/22/2022
Image Segmentation AlgorithmMultilevel ThresholdingGradient Vector Image ProcessingImage Binarization TechniqueParametric Preprocessing Method
This study introduces a simple, effective noniterative global and bilevel thresholding method using image gradient vectors for binarizing images into three clusters, alongside a parametric preprocessing technique for image restoration, outperforming traditional methods in challe
02
A Comprehensive Survey of Multi‑Level Thresholding Segmentation Methods for Image Processing
Published:3/27/2024
Multi-Level Thresholding SegmentationImage Processing TechniquesMetaheuristic AlgorithmsAutomated Threshold SelectionComplex Image Processing
The paper reviews multilevel thresholding methods in image processing, focusing on capturing image complexity through multirange intensity partitioning. It discusses metaheuristic algorithms for optimizing threshold values and outlines advantages, limitations, and future resear
03
Analysis of Image Processing Using Morphological Erosion and Dilation
Published:10/1/2021
Morphological Image Processing TechniquesImage Feature ExtractionImage Noise RemovalMorphological Erosion and Dilation
This paper addresses the challenge of improving image quality through morphological erosion and dilation techniques, employing experimental analysis to assess their effectiveness in noise reduction and feature extraction, ultimately revealing enhanced clarity and interpretability
02
$\mathcal{D(R,O)}$ Grasp: A Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping
Published:10/3/2024
Interaction Model for Robotic ManipulationDexterous Grasping in Complex EnvironmentsPoint Cloud-based Grasp PredictionAdaptability and Generalization for Robotic HandsCross-Embodiment Dexterous Manipulation Framework
The D(R,O) Grasp framework models robothand and object interaction, enabling broad generalization across various robot hands and object geometries. It efficiently predicts stable grasps, achieving 87.53% and 89% success rates in simulations and realworld tests, respectively.
03
A comprehensive review of slaughterhouse wastewater treatment and concomitant resource recovery
Published:1/1/2024
Slaughterhouse Wastewater TreatmentWater Resource RecoveryFood Processing Wastewater ManagementMeat Processing IndustryIndustrial Wastewater Treatment Technologies
This paper reviews slaughterhouse wastewater treatment and resource recovery, identifying slaughterhouses as major water consumers. It analyzes wastewater characteristics, global regulations, and evaluates various treatment technologies while exploring potential water reuse and r
01
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Published:11/27/2025
Audio-Visual Generation SynchronizationCross-Task Synergy TrainingGlobal-Local Decoupled Interaction ModuleSynchronization-Enhanced Classifier-Free Guidance (SyncCFG)Joint Diffusion Process Optimization
Harmony introduces a novel framework to address synchronization challenges in audiovisual generation, leveraging crosstask synergy, a GlobalLocal Decoupled Interaction Module, and SyncCFG to enhance alignment and generation fidelity significantly.
02
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Published:11/25/2025
Block-Diffusion Video GenerationWorld Model SimulationSemi-Autoregressive DecodingInteractive Video StreamingHigh-Quality Video Synthesis
Inferix is a blockdiffusion based inference engine designed for highquality, variablelength immersive world simulations. Utilizing a semiautoregressive decoding paradigm, it integrates diffusion and autoregressive strengths, enhancing realtime interaction and supporting fine
04
……