Papers

Sign in to view your remaining parses.
Tag Filter
Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation
Published:9/11/2025
Multimodal Recommendation SystemsFine-Grained Cross-Modal Association ModelingBidirectional Attention MechanismGlobal Distribution Consistency RegularizationDilated Refinement Attention Module
This study presents MambaRec, an innovative multimodal recommendation framework addressing finegrained crossmodal association modeling and global consistency issues through attentionguided learning. Its core contribution is the Dilated Refinement Attention Module, enhancing fu
03
Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals
Published:10/8/2025
Multimodal Generative Recommendation SystemFusion of Collaborative and Semantic SignalsSelf-Supervised Quantization LearningSequential Recommender SystemsDINO Framework
The paper introduces MSCGRec, a generative recommendation system that addresses limitations in current sequential recommenders by integrating multiple semantic modalities and collaborative features. Empirical results show superior performance on three realworld datasets, validat
01
MorphQPV: Exploiting Isomorphism in Quantum Programs to Facilitate Confident Verification
Published:4/24/2024
Quantum Program VerificationIsomorphism MethodConfident Assertion-Based Verification MethodConstraint Optimization ProblemQuantum Algorithm Debugging
MorphQPV is a confident assertionbased verification method for quantum programs, leveraging isomorphism to establish structural preservation relations among runtime states. It transforms verification into a constraint optimization problem, significantly improving efficiency and
01
ModRWKV: Transformer Multimodality in Linear Time
Published:11/1/2025
ModRWKV Multimodal FrameworkRWKV ArchitectureLinear-Time TransformerMultimodal Large Language ModelsDynamically Adaptable Heterogeneous Modality Encoders
This study introduces ModRWKV, a framework based on RWKV architecture that achieves multimodal processing with linear time complexity, outperforming traditional quadraticcomplexity Transformer models. It balances performance and computational efficiency for multisource informat
013
Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation
Published:9/11/2024
Generative Recommendation SystemsMulti-Aspect Semantic TokenizationText-Based Reconstruction TasksLong-Tailed Recommendation IssuesCold-Start Recommendation
This paper introduces LAMIA, a novel multiaspect semantic tokenization framework that enhances generative recommendation systems. Unlike traditional methods, it learns independent embeddings capturing multiple facets of items, significantly improving recommendation accuracy for
02
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Published:12/16/2017
Tacotron 2 Speech SynthesisWaveNet VocoderMel Spectrogram PredictionSequence-to-Sequence Feature PredictionNeural Network Speech Synthesis Architecture
This paper presents Tacotron 2, a neural network architecture for direct texttospeech synthesis. It features a sequencetosequence network predicting mel spectrograms and a modified WaveNet as a vocoder. The model achieves a mean opinion score of 4.53, comparable to profession
01
Tacotron: Towards End-to-End Speech Synthesis
Published:3/30/2017
End-to-End Speech Synthesis ModelTacotron ModelSequence-to-Sequence LearningText-to-Speech SynthesisGenerative Models in NLP
Tacotron is an endtoend texttospeech model that synthesizes speech directly from characters, simplifying complex traditional TTS systems. Trained from scratch, it scores 3.82 in mean opinion, outperforming existing systems in naturalness and offering faster generation speeds.
01
WaveNet: A Generative Model for Raw Audio
Published:9/13/2016
Audio Generation ModelWaveNet ArchitectureText-to-Speech SynthesisAutoregressive ModelingMusic Generation
WaveNet is introduced as a deep neural network for raw audio generation, featuring probabilistic and autoregressive properties. It excels in texttospeech tasks, surpassing existing systems in naturalness, and shows high realism in music generation while also achieving promising
01
End-to-End Speech Recognition Contextualization with Large Language Models
Published:9/20/2023
LLM-based Speech RecognitionText Context Augmented Speech RecognitionMixed-Modal Language ModelingDecoder-Only Speech RecognitionLow-Parameter Adapter Method
The paper introduces a novel speech recognition contextualization method using Large Language Models, reframing ASR as mixedmodal language modeling. Adding textual context reduces word error rate by 6%, achieving a 7.5% improvement over the baseline system.
01
End-to-End Speech Recognition: A Survey
Published:3/3/2023
End-to-End Speech Recognition ArchitecturesDeep Learning in Speech RecognitionAll-Neural ASR ModelsTaxonomy of Automatic Speech Recognition ModelsTraining and Decoding of Speech Recognition Models
This survey reviews advancements in endtoend automatic speech recognition (ASR) models, highlighting deep learning's impact on reducing word error rates. It provides a taxonomy of E2E models, discusses their properties, relates them to traditional hidden Markov models, and cove
02
Fun-ASR Technical Report
Published:9/16/2025
LLM-based Automatic Speech Recognition SystemOptimizations for Practical Deployment in Speech RecognitionReinforcement Learning Applications in Speech RecognitionData-Driven Large-Scale Speech RecognitionStreaming Capability in Speech Recognition
FunASR integrates large data, model capacity, and deep LLM integration, optimizing through reinforcement learning for realworld challenges. It achieves stateoftheart performance on industrial datasets, demonstrating effectiveness in streaming, noise robustness, and codeswit
02
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Published:7/5/2024
Multilingual Speech Recognition and GenerationEmotion Recognition in SpeechSpeech-to-Speech TranslationZero-Shot Voice CloningNatural Voice Interaction between Humans and LLMs
This report presents FunAudioLLM, a model family enhancing natural voice interaction between humans and LLMs. It features SenseVoice for multilingual speech and emotion recognition, and CosyVoice for natural voice generation, supporting applications like speech translation and em
02
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
Published:1/1/1990
Hidden Markov ModelsStatistical Modeling in Speech RecognitionApplications of Markov Source ModelsReview of Statistical Methods
This tutorial reviews the theory of Hidden Markov Models (HMMs) and their applications in speech recognition, highlighting their rich mathematical structure and effectiveness in practical applications.
02
Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity
Large Language Model Fine-TuningLong-Context ModelingSparse Attention Mechanism
Jenga is a novel LLM finetuning system that optimizes activation memory usage in longcontext applications using Contextual Token Sparsity. It employs token elimination, pattern prediction, and kernel optimization, achieving up to 1.93x memory reduction and 1.36x acceleration ov
04
Objaverse-XL: A Universe of 10M+ 3D Objects
Published:7/11/2023
Objaverse-XL Dataset3D Vision TasksMulti-View Rendered ImagesZero-Shot Generalization CapabilityAcquisition of High-Quality 3D Data
The paper introduces ObjaverseXL, a dataset of over 10 million 3D objects, addressing the scarcity of highquality data in 3D vision tasks. Training on 100 million multiview images achieved significant zeroshot generalization, promoting innovation in 3D vision.
02
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Published:12/4/2025
Training-Free Video GenerationVideo Extrapolation GenerationTemporal Attention MechanismImportance-Aware KV Cache Pruningvideo diffusion models
The paper presents the Deep Forcing mechanism that addresses temporal repetition, drift, and motion deceleration in autoregressive video diffusion. Using trainingfree Deep Sink and Participative Compression, it achieves over 12x extrapolation, enhancing video quality and consist
04
ASTNet: Asynchronous Spatio-Temporal Network for Large-Scale Chemical Sensor Forecasting
Published:8/3/2025
Large-Scale Chemical Sensor ForecastingSpatiotemporal Dependency ModelingAsynchronous Spatio-Temporal NetworkGraph Fusion MechanismChemical Engineering Applications
ASTNet is an asynchronous spatiotemporal network addressing high latency and complexity in largescale chemical sensor forecasting. It integrates temporal and spatial encoders for concurrent learning and employs a gated graph fusion mechanism for static and dynamic sensor graphs.
03
STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models
Published:11/24/2025
Scalable Ranking ModelsSemantic TokenizationOrthogonal Rotation TransformationHigh-Dimensional Feature SparsityEfficient Attention Mechanism
The paper introduces STORE, a scalable ranking framework addressing representation and computational bottlenecks in personalized recommendation systems through semantic tokenization, efficient attention, and orthogonal rotation.
03
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Published:1/12/2025
AIOps FrameworkApplication of Large Language Models in AIOpsMicroservice Cloud EnvironmentProactive Fault ManagementHolistic AI Agent Evaluation
AIOPSLAB is introduced as a framework to evaluate AI agents for automating IT operations in complex cloud environments. It integrates fault injection, workload generation, and telemetry export, enabling design and assessment of endtoend AI solutions, showcasing the potential an
01
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Published:1/1/2025
Fault Detection for Large-Scale Distributed Model TrainingAutomated Fault Detection SystemMonitoring Metric Pattern RecognitionDistributed Training Task MonitoringOptimization of Machine Fault Reaction Time
Minder is an automated faulty machine detection system for largescale distributed model training, accurately identifying fault patterns with an average response time of 3.6 seconds, achieving 0.904 precision and 0.893 F1score, significantly reducing manual diagnosis time and co
02