Tags: Transformer architecture - Paper Library

MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers

Published:11/6/2024

Transformer architectureEfficient Attention MechanismIn-Memory Lookup TablesReduction of Computational ComplexityMulti-Head Attention Operation

MemoryFormer is a novel transformer architecture that reduces computational complexity by eliminating most fullyconnected layers while retaining necessary multihead attention operations, utilizing inmemory lookup tables and hash algorithms for dynamic vector retrieval, validat

03

Scalable Diffusion Models with Transformers

Published:12/20/2022

Diffusion ModelsTransformer architectureImage GenerationScalable Diffusion ModelsClass-Conditional Image Generation

This study introduces Diffusion Transformers (DiTs), which replace UNet with a transformer architecture for image generation. Higher Gflops correlate with better performance (lower FID), with the largest model achieving stateoftheart results on ImageNet benchmarks.

05

LoRA: Low-Rank Adaptation of Large Language Models

Published:6/18/2021

Low-Rank Adaptation for Large Language ModelsTransformer architectureLarge Language Model Fine-TuningParameter Efficiency OptimizationRoBERTa and Its Derivatives

LoRA introduces a lowrank adaptation method for finetuning large language models, significantly reducing trainable parameters by injecting rank decomposition matrices while freezing the model weights. It achieves comparable or better performance on RoBERTa, DeBERTa, GPT2, and

02

Octo: An Open-Source Generalist Robot Policy

Published:5/21/2024

Generalist Robot PoliciesMulti-modal action representation and modelingTransformer architectureLarge-Scale Robot Demonstration DatasetRobotic Action Learning

Octo is an opensource transformerbased generalist robot policy pretrained on 800K trajectories, enabling fast finetuning across diverse sensors and robots, guided by language or images, demonstrating strong generalization on nine platforms.

05

Large Language Diffusion Models

Published:2/14/2025

Large Language Diffusion ModelsAuto-Regressive Diffusion ModelLarge Language Model Fine-TuningTransformer architectureProbabilistic Inference Generation

LLaDA, a diffusionbased large language model, uses masking and reverse generation with Transformers to predict tokens, optimizing likelihood bounds. It matches autoregressive baselines in diverse tasks and excels in context learning, demonstrating diffusion models’ promise for s

05

The Devil in Linear Transformer

Published:10/19/2022

Transformer architectureLong-Context ModelingSparse Attention EfficiencyTransformer-Based Efficient Forward Prediction

This paper identifies unbounded gradients and attention dilution as key flaws in kernelbased linear transformers, then introduces TransNormer, which stabilizes training via normalized attention and enhances local focus with diagonal attention, achieving superior accuracy and eff

05

Papers