SCALING LARGE LANGUAGE MODELS FOR NEXT-GENERATION SINGLE-CELL ANALYSIS
TL;DR Summary
This study introduces a novel approach using the Cell2Sentence framework to convert single-cell RNA sequencing data into textual 'cell sentences,' training large language models on over a billion tokens. Scaling to 27 billion parameters resulted in enhanced performance in multice
Abstract
Single-cell RNA sequencing has transformed our understanding of cellular diversity, yet current single-cell foundation models (scFMs) remain limited in their scalability, flexibility across diverse tasks, and ability to natively integrate textual information. In this work, we build upon the Cell2Sentence (C2S) framework, which represents scRNA-seq profiles as textual “cell sentences,” to train Large Language Models (LLMs) on a corpus comprising over one billion tokens of transcriptomic data, biological text, and metadata. By scaling model size to 27 billion parameters, we observe consistent improvements in predictive and generative capabilities, as well as the capacity for advanced downstream tasks requiring synthesis of information across multicellular contexts. Through targeted fine-tuning supported by modern reinforcement learning techniques, our approach excels in tasks such as perturbation response prediction, natural language interpretation, and complex biological reasoning. By unifying transcriptomic and textual data at unprecedented scales, this approach not only surpasses both specialized single-cell models and general-purpose LLMs, but also establishes a powerful platform for next-generation single-cell analysis, paving the way for the development of “virtual cells.”
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SCALING LARGE LANGUAGE MODELS FOR NEXT-GENERATION SINGLE-CELL ANALYSIS
1.2. Authors
The paper lists numerous authors primarily affiliated with Yale University and Google Research/Google DeepMind.
-
Lead/Corresponding Authors: Syed Asad Rizvi (Yale University, Google Research), Shekoofeh Azizi (Google DeepMind), Bryan Perozzi (Google Research), David van Dijk (Yale University).
-
Other Key Contributors: Daniel Levine, Aakash Patel, Shiyang Zhang, Eric Wang, Sizhuang He, David Zhang, Cerise Tang, Zhuoyang Lyu, Rayyan Darji, Chang Li, Emily Sun, David Jeong, Lawrence Zhao, Jennifer Kwan, David Braun, Brian Hafler, Jeffrey Ishizuka, Rahul M. Dhodapkar, Hattie Chung.
Their research backgrounds span various fields including computer science, artificial intelligence, bioinformatics, and biomedical research, reflecting the interdisciplinary nature of this work.
1.3. Journal/Conference
This paper is currently a preprint, indicated by "A PREPRINT - APRIL 15, 2025" and the common use of bioRxiv citations (e.g., [13, 14]). Preprints are scientific manuscripts posted online before peer review, allowing for early dissemination of research findings. While not yet peer-reviewed, this indicates active research in the field, often from reputable institutions like Yale University and Google Research, which are highly influential in AI and biomedical research.
1.4. Publication Year
April 15, 2025
1.5. Abstract
Single-cell RNA sequencing (scRNA-seq) has revolutionized the understanding of cellular diversity. However, existing single-cell foundation models (scFMs) face limitations in scalability, flexibility across diverse tasks, and native integration of textual information. This work addresses these challenges by building upon the framework, which represents scRNA-seq profiles as textual "cell sentences." The authors trained Large Language Models (LLMs) on a massive corpus exceeding one billion tokens, comprising transcriptomic data, biological text, and metadata. By scaling model size up to 27 billion parameters, they observed consistent improvements in both predictive and generative capabilities, alongside enhanced capacity for advanced downstream tasks that require synthesizing information across multicellular contexts. Through targeted fine-tuning, supported by modern reinforcement learning (RL) techniques like Group Relative Policy Optimization (GRPO), their approach demonstrates superior performance in tasks such as perturbation response prediction, natural language interpretation, and complex biological reasoning. This unification of transcriptomic and textual data at unprecedented scales not only surpasses both specialized single-cell models and general-purpose LLMs but also establishes a powerful platform for next-generation single-cell analysis, laying the groundwork for the development of "virtual cells."
1.6. Original Source Link
The original source link is /files/papers/6915ae034d6b2ff314a02e95/paper.pdf. This is a local path, and the paper explicitly states it is a preprint.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a crucial problem in single-cell analysis: despite the transformative power of single-cell RNA sequencing (scRNA-seq) in revealing cellular diversity, existing single-cell foundation models (scFMs) are limited. These limitations manifest in three key areas:
-
Scalability: Current models struggle to effectively scale to the ever-increasing volume and complexity of single-cell data.
-
Flexibility Across Diverse Tasks: They often lack the versatility to perform a broad range of downstream biological analyses without significant architectural modifications or retraining.
-
Native Integration of Textual Information: Critically, they generally cannot inherently incorporate rich biological text, metadata, and annotations, which are vital for contextualizing and interpreting genomic data.
This problem is important because
scRNA-seqgenerates vast amounts of data, and extracting meaningful biological insights requires models that can handle this scale and integrate all available information, including the extensive existing biological literature. Priortranscriptomic foundation modelslikescGPT[4],Geneformer[5],scFoundation[6], andscGeneT[7] have shown promise but still face these challenges.
The paper's innovative idea is to leverage the robust scaling behavior and flexible architecture of Large Language Models (LLMs) by transforming high-dimensional scRNA-seq data into a textual format compatible with LLMs. This is achieved through the framework, which represents scRNA-seq profiles as sequences of gene names ordered by expression level, effectively creating "cell sentences." This approach positions single-cell data within the LLM framework, offering inherent advantages in scalability and flexibility compared to specialized model architectures.
2.2. Main Contributions / Findings
The primary contributions of this work, C2S-Scale, are multifaceted and establish a new paradigm for single-cell analysis:
-
Scaling Single-Cell Analysis with LLMs:
- Larger Model Capacity: Introduction of the
C2S-Scalefamily, with models ranging from 410 million to 27 billion parameters, based onGemma-2[15] andPythia[16] architectures. This significantly increases model capacity to capture complex biological relationships. - Increased Performance at Scale: Demonstration of clear
scaling lawsforLLMsin single-cell analysis, showing consistent performance improvements in both predictive and generative tasks as model size increases. - Massive Data Size and Multimodality: Training on a 1-billion token multimodal corpus from over 50 million human and mouse cells, integrating transcriptomic data with corresponding biological text (e.g., paper abstracts) and metadata. This aligns single-cell data with natural language and biological context.
- Long-Context, Multi-Cell Capabilities: Support for extended context lengths (up to 8192 tokens) enables processing of comprehensive multimodal and multi-cell inputs simultaneously, facilitating analysis of cellular interactions and complex biological processes.
- Diverse Downstream Applications:
C2S-Scaleexcels across a significantly broader range of challenging biological reasoning tasks, including perturbation prediction, nuanced natural language interpretation of single-cell data, and complex question answering.
- Larger Model Capacity: Introduction of the
-
Reinforcement Learning for Enhanced Performance: The paper leverages
Group Relative Policy Optimization (GRPO)[17], a modern reinforcement learning technique, to further refineC2S-Scalefor targeted single-cell tasks, demonstrating significant performance improvements in question answering and perturbation response prediction. -
A Novel Metric for Evaluating Single-Cell Generative Models: Introduction of the
single-cell Fréchet Inception Distance (scFID), which leveragessingle-cell foundation modelembedding space to assess the quality of generated cells in a biologically meaningful way, unlike traditional expression-level metrics. -
Open-Source Models and Resources: The authors commit to releasing their code and model weights, along with transcriptomic-language integrated datasets and prompts, to foster community research and development.
In summary,
C2S-Scaleunifies transcriptomic and textual data at unprecedented scales usingLLMs, surpassing both specialized single-cell models and general-purposeLLMs. It provides a powerful platform fornext-generation single-cell analysis, moving towards the concept of "virtual cells" where complex biological phenomena can be simulated and understood with greater depth.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand C2S-Scale, a foundational grasp of single-cell RNA sequencing, Large Language Models, and Reinforcement Learning is essential.
Single-Cell RNA Sequencing (scRNA-seq)
- Conceptual Definition:
scRNA-seqis a high-throughput technology that measures gene expression levels in individual cells. Unlike bulkRNA-seq, which averages gene expression across a population of cells,scRNA-seqprovides a snapshot of gene activity within each cell, revealing cellular heterogeneity and enabling the discovery of rare cell types, developmental trajectories, and cell-state transitions. - Importance: It has revolutionized biology by allowing researchers to explore cell diversity in tissues, understand disease mechanisms, and identify new therapeutic targets.
- Data Characteristics:
scRNA-seqdata is typically high-dimensional (thousands of genes), sparse (many zero counts per cell), and noisy. The output is often represented as anexpression matrix, where rows are genes and columns are cells, with entries representing the expression level of each gene in each cell.
Large Language Models (LLMs)
- Conceptual Definition:
LLMsare a class of deep learning models, typically based on theTransformerarchitecture, that are trained on vast amounts of text data to understand, generate, and process human language. They learn complex patterns, grammar, and semantic relationships within language. - Transformer Architecture [8]:
- Attention Mechanism: The core innovation of
Transformersis theattention mechanism, which allows the model to weigh the importance of different parts of the input sequence (e.g., words in a sentence) when processing each element. - Self-Attention: A specific type of
attentionwhere the model relates different positions of a single sequence to compute a representation of the sequence. For example, when translating "The animal didn't cross the street because it was too tired,"self-attentionhelps the model determine that "it" refers to "animal." - Multi-Head Attention: Multiple
attentionmechanisms run in parallel, each focusing on different aspects of the relationships within the sequence, and their outputs are concatenated and linearly transformed. - Encoder-Decoder Structure: Original
Transformershave anencoder(processes input) and adecoder(generates output). ManyLLMssimplify this, often using adecoder-onlystructure for generative tasks (like generating text or, in this paper,cell sentences). - Tokenization: Text is broken down into
tokens(words, sub-words, or characters). - Embeddings:
Tokensare converted into high-dimensional numerical vectors (embeddings) that capture their semantic meaning. Theseembeddingsare learned during training. - Positional Encoding: Since
Transformersdo not inherently understand word order,positional encodingsare added to theembeddingsto provide information about the relative or absolute position oftokensin the sequence.
- Attention Mechanism: The core innovation of
- Pre-training:
LLMsare typicallypre-trainedon massive text corpora using objectives likenext token prediction(predicting the next word in a sequence given the previous ones). This unsupervised learning phase allows them to acquire a broad understanding of language. - Fine-tuning: After
pre-training,LLMsarefine-tunedon smaller, task-specific datasets to adapt them to particular downstream applications (e.g., sentiment analysis, question answering). - Scaling Laws [11]:
LLMsexhibit predictablescaling laws, meaning that as model size (number of parameters), dataset size, and computational resources increase, their performance on various tasks consistently improves.
Reinforcement Learning (RL)
- Conceptual Definition:
Reinforcement Learningis a paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It learns through trial and error, rather than explicit instruction. - Reward Function: A critical component of
RLthat defines what constitutes a "good" or "bad" outcome for the agent. - Policy Optimization:
RLalgorithms aim to find an optimalpolicy, which is a mapping from observed states of the environment to actions, that maximizes the expected reward.
3.2. Previous Works
The paper builds upon and contrasts its C2S-Scale framework with several key prior studies:
Cell2Sentence (C2S) Framework [13, 14]
- Core Idea: The foundational work for
C2S-Scale.C2Sproposes to representscRNA-seqprofiles as textual "cell sentences." This is achieved by:- Taking a cell's
gene expression profile(a vector of expression values for thousands of genes). Rank-orderingthe genes by their expression levels from highest to lowest.- Selecting the top most highly expressed genes.
- Concatenating the names of these genes into a
sentence(e.g., "geneA geneB geneC...").
- Taking a cell's
- Purpose: This transformation makes
scRNA-seqdata compatible withLLMs, treating gene names astokensandcell sentencesas sequences thatLLMscan process. - Why it's important for
C2S-Scale:C2S-Scaleis explicitly described as "the next generation ofC2Smodels," significantly improving upon the original paradigm in scale, mode capacity, and downstream applications. The originalC2Sdemonstrated that this textual transformation retains sufficient biological information (as shown in Figure 9, where original gene expression can be predicted from rank with high ).
Existing Single-Cell Foundation Models (scFMs)
- scGPT [4]: A
transcriptomic foundation modelthat uses amasked gene modelingobjective. It has shown promise in modelingsingle-cell transcriptomic databut, like others, faces limitations in scalability, flexibility, and native textual integration compared toLLMs.scGPTis used inC2S-Scaleas an embedding space for thescFIDmetric. - Geneformer [5]: Another
transcriptomic foundation modelthat alsorank-ordersgenes but uses amasked modeling objective. It aims to learn generalizable representations ofsingle-cell data. - scFoundation [6] and scGeneT [7]: Other contemporary
scFMsthat have shown progress in modelingsingle-cell transcriptomic dataat scale but are still constrained by the inherent architectural limitations thatC2S-Scaleaims to overcome throughLLMintegration.scGeneTspecifically focuses on perturbation modeling.
General-Purpose LLMs
- Llama [20, 21, 26], GPT-4o [22], Gemini [23], Meditron [27], BioMistral [28]: These are state-of-the-art general
LLMsthatC2S-Scaleuses as benchmarks, particularly fornatural language interpretationandquestion answeringtasks. The paper demonstrates thatC2S-Scale, while specialized, outperforms these generalLLMsin domain-specific biological reasoning, highlighting the benefit of its specialized pre-training andC2Stransformation.- Example:
Attention is All You Need[8]: This seminal paper introduced theTransformerarchitecture, which is the backbone forLLMslikeGemma-2andPythiaused inC2S-Scale. Theattentionmechanism's core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices representing the input sequence after linear transformations.
- calculates the dot product similarity between queries and keys.
- is a scaling factor, where is the dimension of the key vectors, used to prevent the dot products from becoming too large, which can push the
softmaxfunction into regions with very small gradients. - converts the scores into probabilities, determining the weight of each
Valuevector. - The result is a weighted sum of the
Valuevectors, allowing the model to focus on relevant parts of the input.C2S-Scaleleverages this mechanism to identify key genes or biological contexts withincell sentences.
- Example:
Reinforcement Learning in NLP
- Group Relative Policy Optimization (GRPO) [17]: This
RLtechnique is employed byC2S-Scaleto alignLLMoutputs with user preferences or domain-specific criteria.GRPOoptimizes the model'spolicyby comparing multiple candidate outputs and reinforcing those that are ranked higher by areward model. This is crucial for refiningC2S-Scale's generative capabilities in complex biological tasks. - BERTScore [25]: A metric used to evaluate the quality of generated text by comparing it to reference text using contextual
embeddingsfromBERTmodels.C2S-ScaleusesBioBERTScore(aBERTScorevariant adapted for biological text) as areward functionforGRPOto guide theLLMtowards biologically relevant and high-quality answers.
Generative Model Evaluation
- Fréchet Inception Distance (FID) [48]: A standard metric for evaluating the quality of images generated by
Generative Adversarial Networks (GANs). It measures the distance between the feature distributions of real and generated images in a high-dimensional feature space (typically usingInception-v3network features [49]). - single-cell Fréchet Inception Distance (scFID):
C2S-Scaleintroduces this novel metric as an adaptation ofFIDforsingle-cell data. Instead ofInception-v3features,scFIDuses features (embeddings) from asingle-cell foundation model(likescGPT) to compare real and generatedcell embeddingsin a biologically meaningful space.
3.3. Technological Evolution
The evolution of single-cell analysis has moved from basic clustering and visualization of scRNA-seq data to sophisticated computational models that aim to capture underlying biological processes. Initially, bioinformatics tools focused on data processing, dimension reduction, and statistical analysis. More recently, deep learning and foundation models have emerged:
- Early Models: Often specialized neural networks or autoencoders for dimension reduction, imputation, or cell type annotation.
- Specialized scFMs: Models like
scGPTandGeneformerrepresented a step forward by attempting to learn generalizable representations from largescRNA-seqdatasets, acting as "foundation models" for single-cell biology. However, these were still primarily designed for numerical, transcriptomic data and lacked native language processing capabilities. - The LLM Paradigm Shift: The success of
LLMsin natural language processing, coupled with their robustscaling laws, inspired researchers to adapt them for other domains. TheC2Sframework was a crucial step in this direction, enablingscRNA-seqdata to be framed as a "language." C2S-Scale's Position:C2S-Scalerepresents the cutting edge by fully embracing theLLMparadigm, scaling model capacity, integrating diverse multimodal data (transcriptomics + text + metadata), and employing advancedRLfine-tuning. It attempts to bridge the gap between purely numerical biological data and human-readable biological knowledge, making it possible forLLMsto perform complex biological reasoning directly. This positionsC2S-Scaleas a leader innext-generation single-cell analysis, aiming for a future with "virtual cells."
3.4. Differentiation Analysis
C2S-Scale differentiates itself from previous works through several key innovations:
-
From Specialized Architectures to LLM-Native:
- Prior scFMs (e.g.,
scGPT,Geneformer): Often employ specializedTransformer-like architectures or custom neural network designs tailored forscRNA-seqdata. While effective, these designs can be less flexible and harder to scale to the massive parameter counts seen in generalLLMs. They also typically require significant architectural modifications for different tasks or modalities. C2S-Scale: Leverages general-purposeLLMarchitectures (Gemma-2,Pythia) directly. TheCell2Sentencetransformation is the key; it convertsscRNA-seqinto a textual format, making it inherently compatible withLLMswithout requiring custom architectural modifications. This allowsC2S-Scaleto benefit from theLLMecosystem's rapid advancements in scaling, efficiency, andmultimodality.
- Prior scFMs (e.g.,
-
Native Multimodal Integration (Transcriptomics + Text):
- Prior scFMs: While some models might incorporate
annotationsormetadataas numerical features, they generally do not natively integrate free-form biological text (e.g., paper abstracts) withtranscriptomic datain a unifiedlanguage modelframework. C2S-Scale: Explicitly trains on a massive corpus that includes bothcell sentences(representing transcriptomics) and corresponding biological text and metadata. This allows theLLMto learn the semantic relationships between gene expression patterns and their textual descriptions, enablingnatural language interpretationandbiological reasoningtasks that are beyond the scope of traditionalscFMs.
- Prior scFMs: While some models might incorporate
-
Unprecedented Scale:
- Prior scFMs: Typically operate at scales of hundreds of millions to a few billion parameters.
C2S-Scale: Scales up to 27 billion parameters, a significant increase that allows it to capture more complex and subtle biological relationships, aligning with observedscaling lawsinLLMs.
-
Enhanced Flexibility and Task Versatility:
- Prior scFMs: Often excel at specific tasks (e.g., cell type annotation, perturbation prediction) but may require re-training or significant fine-tuning for new, diverse tasks.
C2S-Scale: Demonstrates versatility across a broad spectrum of tasks, frompredictive(cell type annotation, perturbation prediction) andgenerative(conditional cell generation) to complexnatural language interpretationandbiological reasoning(question answering, cluster captioning), all within a unified framework.
-
Leveraging Reinforcement Learning for Alignment:
- Prior scFMs: Primarily rely on supervised or unsupervised learning objectives.
C2S-Scale: IncorporatesReinforcement Learning(GRPO) to further align the model's outputs with desired biological outcomes and expert preferences, significantly improving performance on complex generative and interpretive tasks likequestion answeringandperturbation response prediction.
-
Novel Evaluation Metric:
-
Prior scFMs: Generative models are often evaluated using expression-level metrics (which can be noisy) or metrics adapted from other domains.
-
C2S-Scale: IntroducesscFID, a domain-specific metric that leveragesscFM embedding spaceto provide a more biologically meaningful assessment of generatedsingle-cell data.In essence,
C2S-Scaleis not just anotherscFMbut rather a re-conceptualization of single-cell analysis using the full power and flexibility ofLLMs, specifically designed to integrate the rich context of biological language with high-dimensional transcriptomic data.
-
4. Methodology
4.1. Principles
The core principle behind C2S-Scale is to transform single-cell RNA sequencing (scRNA-seq) data into a format that Large Language Models (LLMs) can natively understand and process. This "textualization" of scRNA-seq profiles, termed "cell sentences," allows the vast capabilities of LLMs (such as understanding context, generating coherent sequences, and performing complex reasoning) to be applied directly to biological data. By treating gene names as tokens and ordered gene lists as sentences, C2S-Scale leverages LLM architectures to learn complex biological relationships, integrate multimodal information (transcriptomics, text, metadata), and perform a wide range of downstream tasks, overcoming the scalability and flexibility limitations of traditional scFMs.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Data Collection and Preparation
The first step involves assembling a massive and diverse corpus for pre-training.
- Scale: Over 50 million
single-cell transcriptomic profilesfrom human and mouse tissues. - Sources: Publicly available single-cell atlases such as
CellxGene[2] and theHuman Cell Atlas[3]. - Multimodal Integration: Crucially, this dataset is enriched with associated
annotations,metadata, andpaper abstractsfrom biological publications. This ensures the model learns to aligntranscriptomic datawithnatural languageandbiological context. This multimodal corpus totals over 1 billiontokens. - Task Generation: This corpus is then curated into over 150 million
multi-task training samples, enabling theLLMto learn diverse tasks while simultaneously integrating various forms of information.
4.2.2. Cell Sentence Transformation
This is the pivotal step that enables the integration of scRNA-seq data with LLMs.
- Input: For each cell, the raw input is an
expression vector, whereX _ { k }denotes the normalized expression value of gene in that cell, and is the total number of genes. - Process: The
cell sentencefor is constructed by:Rank-orderingthe genes within that cell based on their expression levels in descending order.- Selecting the top most highly expressed genes.
- Formula: If is a list of indices from 1 to sorted in descending order based on expression level in , then the
cell sentencetransformation is defined as: Where:S(X)represents thecell sentencegenerated from theexpression vector.- are the indices of the genes corresponding to the first, second, ..., -th most highly expressed genes in cell .
- is a function that maps a gene index to its official gene name (e.g., "TP53," "CD3D").
- and are special start and end
tokensfor thecell sentence, indicating its boundaries to theLLM. - The paper notes that this transformation is
reversible with minimal information loss, meaning the original gene expression values can be recovered from the rank order with high accuracy (Figure 9 illustrates this with an of 85% for a linear model predicting original expression from rank).
- Benefits: This transformation offers two primary benefits:
- LLM Compatibility: It allows immediate application of any
LLMarchitecture without requiring custom modifications. - Biological Meaning Retention: The
rank-orderingprocess naturally captures crucialrelative expressioninformation, whichLLMscan then learn to represent.
- LLM Compatibility: It allows immediate application of any
4.2.3. LLM Architecture and Components
C2S-Scale utilizes LLMs based on the Transformer architecture [8], specifically decoder-only Transformer architectures, which are well-suited for causal language modeling and generative tasks. The base architectures used are Gemma-2 [15] and Pythia [16].
- Key Architectural Components:
- Word Embedding:
LLMsrepresent input sequences ashigh-dimensional vectors(wordembeddings). InC2S-Scale, each gene name in acell sentenceis treated as atokenand converted into anembedding. This is achieved by theLLM'stokenizerand anembedding layertrained alongside the model. Theseembeddingscapture the semantic information of genes, informed by both biological context and learned co-occurrence patterns. - Attention Mechanism: The
attention mechanism(specificallyself-attention[8]) is central. It allows the model to dynamically weigh the importance of different genes within acell sentenceor across multiplecell sentencesand contextualtokens. This enables the model to identify long-range dependencies in gene expression and how genes relate to specific biological tasks (e.g., keymarker genesfor cell type prediction,perturbation-associated genesfor response prediction). - Feedforward Networks: Each
attention layeris followed byfeedforward networksthat apply non-linear transformations to enhance feature extraction. - Residual Connections and Layer Normalization: These components are critical for stabilizing training and facilitating
gradient flow, enabling effective scaling to large parameter sizes.
- Word Embedding:
4.2.4. Pre-training (Two-Phase Approach)
The training process for C2S-Scale involves two-phase large-scale pre-training followed by additional fine-tuning for specific tasks.
-
Pre-training Objective: The primary
pre-training objectiveforLLMsisnext token prediction. This involves predicting the subsequenttokenin a sequence given the precedingtokens. In the context ofcell sentences, this means predicting the next gene name in therank-ordered sequencebased on the previously seen genes. This objective forces the model to capture complex dependencies and relationships within thecell sentences. -
Corpus: The multimodal corpus described in Section 4.2.1 is used. It integrates
cell sentenceswith associatedmetadataandtextual annotations. -
Multi-Task Learning:
C2S-Scaleemploysmulti-task learningduringpre-training. This means the model is exposed to a variety of task instructions and correspondingcell sentencesor textual outputs. This allows theLLMto learn general representations and adapt to differentscRNA-seq analysis tasksby following prompt instructions. The following table, adapted from the paper, illustrates themulti-task prompt formatsused duringpre-trainingto enable diverse learning:Task name Type Input information Target output Metric Single cell language modeling Single-cell Single cell sentence Overlap % Cell type annotation Single-cell Single cell sentence Cell type BertScore Conditional cell generation Single-cell Cell type of one cell Single cell sentence Overlap % Multiple cell language modeling Multi-cell Multiple cell sentences Overlap % Tissue sample annotation Multi-cell Multiple cell sentences Tissue label BertScore Sample cell type(s) annotation Multi-cell Multiple cell sentences Cell types of multiple cells BertScore Conditional sample generation (tissue) Multi-cell Tissue annotation Multiple cell sentences Overlap % Conditional sample generation (cell type) Multi-cell Cell types of multiple cells Multiple cell sentences Overlap % Conditional sample generation (abstract) Multi-cell Paper abstract Multiple cell sentences Overlap % Natural language interpretation Multi-cell Multiple cell sentences Paper abstract BertScore Gene set enumeration Gene set Gene set name List of genes in gene set Overlap % Gene set naming Gene set List of genes in gene set Gene set name BertScore -
Optimization:
AdamWoptimizer andgradient checkpointingare used to efficiently manage computational resources for models ranging from 1 billion to 27 billion parameters. -
Tools:
Huggingface[59] andPyTorch[60] are used for models up to 1B parameters, whileJaxandTPU-based computeare employed for larger models (1B to 27B parameters).
4.2.5. Post-training Methods (Fine-tuning and Reinforcement Learning)
After pre-training, C2S-Scale models are further refined using Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
-
Supervised Fine-Tuning (SFT):
- Objective: To adapt the
pre-trainedmodel to specific downstream tasks or new datasets. - Process: The model is
fine-tunedusing task-specific labeled data, often formatted withnatural language prompts(e.g., "Predict the cell type of this cell sentence: [cell sentence]"). - LoRA (Low-Rank Adaptation) [61]: For parameter-efficient
fine-tuning,LoRAis employed. This technique injects low-rank matrices into theTransformerlayers, allowing adaptation to new tasks with significantly fewer trainable parameters compared to fullfine-tuning. This makesfine-tuningmore computationally efficient and reduces storage requirements for task-specific models.
- Objective: To adapt the
-
Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) [17]:
- Objective: To further enhance performance on generative and interpretative tasks by aligning
LLMoutputs with human preferences or specific biological criteria. - GRPO Process:
- Candidate Generation: The
SFTmodel generates multiple candidate outputs (e.g., answers to a question, perturbation responses) for each training example. - Reward Modeling/Ranking: Instead of human ranking,
C2S-Scaleleveragesdomain-specific criteriaandautomated metrics(e.g.,BioBERTScore[25] for text quality,scFIDfor cell generation quality) to assess the quality of these candidate outputs. These metrics implicitly provide a ranking or preference signal. - Policy Optimization:
GRPOthenfine-tunesthe model to favor outputs that receive higher scores (i.e., higher-quality and more biologically aligned responses). It optimizes the model'spolicyto generate better responses.
- Candidate Generation: The
- Advantages over other RL methods (e.g., PPO [62]):
GRPOoffers a more streamlined workflow as it doesn't require complexreward modelsor extensive human feedback; it can directly useautomated metricsfor ranking. This makes it particularly effective for domain-specificLLMapplications.
- Objective: To further enhance performance on generative and interpretative tasks by aligning
4.2.6. Long-Context and Multi-Cell Capabilities
- Context Length:
C2S-Scalemodels support extended context lengths up to 8192tokens. - Benefits: This allows for:
- Comprehensive Multimodal Input: Integration of diverse contextual information such as biological annotations, manuscript text, perturbation conditions, and detailed task-specific instructions alongside
cell sentences. - Multi-Cell Analysis: The model can process and generate data for multiple cells simultaneously. This is crucial for analyzing
cellular interactions,spatial relationships, andcomplex biological processesthat involvemulticellular contexts.
- Comprehensive Multimodal Input: Integration of diverse contextual information such as biological annotations, manuscript text, perturbation conditions, and detailed task-specific instructions alongside
4.2.7. Diverse Downstream Applications
The C2S-Scale framework is evaluated and fine-tuned on a broad range of challenging biological tasks:
- Predictive Tasks:
Cell type annotation: Predicting the cell type given acell sentence.Perturbation response prediction: Predicting how gene expression changes in response to specific perturbations.
- Generative Tasks:
Conditional cell generation: Generatingcell sentences(representing gene expression profiles) under specific conditions (e.g., a particular cell type).Cluster captioning: Generating natural language descriptions for groups of cells.
- Natural Language Interpretation and Reasoning Tasks:
-
Dataset interpretation: Summarizing the biological findings of an entirescRNA-seqdataset. -
Question Answering (QA): Answering complex biological questions based onscRNA-seqdata. -
Spatial reasoning: Predictingspatial relationshipsandnichesfrommulti-cell context. -
Gene set enumerationandnaming: Identifying genes belonging to a set or naming a set based on its genes.This wide array of applications showcases the versatility and power of the
LLM-basedapproach forsingle-cell analysis.
-
The following figure (Figure 2 from the original paper) summarizes the C2S-Scale framework, showing its multimodal input, cell sentence transformation, and diverse outputs:
该图像是示意图,展示了单细胞和大规模RNA测序数据的整合,包括生物注释和自然语言注释的关系。图中展示了不同细胞表达特征和细胞句子生成的流程,并阐述了应用于细胞类型预测、扰动响应预测、条件细胞生成及问答的模型输出。
Figure 2: C2S-Scale bridges scRNA-seq data and natural language by training LLMs to perform single-cell analysis tasks on diverse multimodal data. (A) A multimodal corpus of over 50 million human and mouse transcriptomes is papers, gene sets, and disease label from scRNA-seq studies. (B) C2S rank-orders genes by expression and converts them to natural language "cell sentences", leveraging powerful LLM architectures without the need for custom Transformer models. The output models are then fine-tuned for diverse downstream tasks, including cell type prediction, perturbation prediction, generative tasks, and advanced biological reasoning tasks such as question answering.
5. Experimental Setup
5.1. Datasets
C2S-Scale utilizes a massive and diverse collection of datasets for both pre-training and fine-tuning/evaluation.
Pre-training Corpus
- Composition: Over 50 million
single-cell transcriptomic profilesfrom human and mouse tissues. - Sources: Primarily gathered from publicly available single-cell atlases, including
CellxGene[2] and theHuman Cell Atlas[3]. - Multimodality: This corpus is rich, encompassing:
Transcriptomic data: ThescRNA-seqexpression profiles themselves.Associated annotations: Cell type labels, tissue labels, experimental conditions.Metadata: Information about the samples, donors, experimental setup.Biological text:Paper abstractsand other free-text descriptions related to the studies.
- Scale: This multimodal data is tokenized to form a corpus of over 1 billion
tokens.
Downstream Task-Specific Datasets
- Cell Type Annotation:
- An immune tissue dataset [63] and a lung dataset [19].
- For each, 80% of cells were used for
trainingand 20% forevaluation.
- Multi-Cell Integration:
- To evaluate
multicellular contextandspatial reasoning, theCosMx Spatial Molecular Imager Human Liver dataset[35] was used. This dataset provides spatially-resolvedsingle-cell datafrom both normal and tumor liver tissues across different donors, encompassing over 800,000 single cells. Cells with fewer than three cells or expressing fewer than 50 genes were filtered out. The data was normalized. Spatial coordinates were embedded usingUMAP[64] orPHATE[65]. - External biological knowledge for spatial reasoning:
CellPhoneDB[36] (receptor-ligand interactions) andBioGRID[37] (protein-protein interaction data). Data was restricted to interactions involving the 1,000 genes in theCosMxdataset and to extracellular proteins.
- To evaluate
- Question Answering (QA):
- A custom
QA datasetwas generated usingGPT-4.5[22]. For eachscRNA-seqstudy,GPT-4.5was prompted to generate meaningfulquestion-answer pairsfrom three sections of the manuscript (abstract, discussions, results) and sampled data from the study. Each study yielded 100QA pairs.
- A custom
- Perturbation Response Prediction:
- The
Dong et al. dataset[47]: Contains immune cells exposed to individual and combinatorial cytokines, with 133 conditions. Gene expression changes in response to these perturbations are the target for prediction. - The
L1000 dataset[46]: Aconnectivity mapdataset used forgene expression profilingin response to various chemical and genetic perturbations.
- The
- Cluster Captioning and Dataset Interpretation:
-
For
cluster captioning, 30scRNA-seqdatasets were used, and natural language captions for cell clusters were generated usingGPT-4.5[22]. -
For
dataset interpretation, the model was evaluated onscRNA-seqdata andpaper abstractsfrom 613CellxGenedatasets (in-distribution) and completely unseen datasets (out-of-distribution). The goal was to generate high-level summaries ofscRNA-seqdatasets fromcell sentences.The following figure (Figure 10 from the original paper) provides concrete examples of abstract summaries generated from
scRNA-seqdatasets, giving an intuitive sense of the textual data used for interpretation tasks:
-
Figure 10: Example abstract summaries from scRNA-seq datasets collected from CellxGene [2].
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a detailed explanation is provided:
Overlap %
- Conceptual Definition: This metric quantifies the percentage of
tokens(gene names) that are common between a generatedcell sentenceorgene listand the ground-truthcell sentenceorgene list. It is used primarily for generative tasks where the output is a sequence of genes (e.g.,conditional cell generation,single cell language modeling,gene set enumeration). A higherOverlap %indicates better generative fidelity. - Mathematical Formula:
$
\text{Overlap %} = \frac{|\text{Generated Genes} \cap \text{Ground Truth Genes}|}{|\text{Ground Truth Genes}|} \times 100
$
Where:
- is the set of genes in the model's output.
- is the set of genes in the reference (actual) output.
- denotes the cardinality (number of elements) of a set.
BERTScore [25]
- Conceptual Definition:
BERTScoreis a metric for evaluating the quality of text generation by comparing candidate sentences to reference sentences. Unlike traditional metrics like BLEU or ROUGE that rely on exact word matches,BERTScoreleverages contextualembeddingsfromBERTmodels to measure semantic similarity. It calculates precision, recall, and F1 scores based on the cosine similarity betweenembeddingsoftokensin the candidate and reference sentences. A higherBERTScoreindicates greater semantic similarity between generated and human-written text. The paper usesBioBERTScore, which is implicitly a version ofBERTScoreadapted or fine-tuned for biological text. - Mathematical Formula:
BERTScorecomputes precision (P), recall (R), and F1-score (F1). For a candidate sentence and a reference sentence ,BERTScoreis calculated as: $ P = \frac{1}{\sum_{c_i \in c} 1} \sum_{c_i \in c} \max_{r_j \in r} \left( \text{cosine_similarity}(\text{embedding}(c_i), \text{embedding}(r_j)) \right) $ $ R = \frac{1}{\sum_{r_j \in r} 1} \sum_{r_j \in r} \max_{c_i \in c} \left( \text{cosine_similarity}(\text{embedding}(r_j), \text{embedding}(c_i)) \right) $ $ F1 = 2 \times \frac{P \times R}{P + R} $ Where:- represents the -th token in the candidate sentence.
- represents the -th token in the reference sentence.
- is a function that maps a token to its contextual
BERT embedding. - measures the cosine of the angle between two
embedding vectors. - The
maxoperation indicates that eachtokenin one sentence is compared to alltokensin the other sentence, and the highest similarity score is taken.
Maximum Mean Discrepancy (MMD)
- Conceptual Definition:
MMDis a statistical test used to determine if two samples are drawn from the same distribution, often applied to evaluate the similarity between the distributions of real and generated data. In the context ofsingle-cell data, it measures the "distance" between the distribution of real cellembeddingsand the distribution of generated cellembeddings. A lowerMMDscore indicates that the generated data's distribution is closer to the real data's distribution. - Mathematical Formula:
MMDis defined using akernel functionk(x, y). For two samples and from distributions and respectively, the empiricalMMDsquared is: $ \mathrm{MMD}^2(X, Y) = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n k(x_i, x_j) + \frac{1}{m^2} \sum_{i=1}^m \sum_{j=1}^m k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^n \sum_{j=1}^m k(x_i, y_j) $ Where:- represents the set of real cell
embeddings. - represents the set of generated cell
embeddings. - and are the number of samples in and , respectively.
k(x, y)is apositive definite kernel function(e.g., Gaussian kernel) that implicitly maps data into aReproducing Kernel Hilbert Space (RKHS).
- represents the set of real cell
Wasserstein Distance
- Conceptual Definition: Also known as Earth Mover's Distance (EMD), the
Wasserstein distancemeasures the minimum "cost" of transforming one distribution into another. Intuitively, it's the minimum amount of "work" required to move a pile of dirt (one distribution) to match the shape of another pile of dirt (the other distribution). Insingle-cell analysis, it quantifies how much two distributions of cellembeddings(real vs. generated) differ. A smallerWasserstein distanceimplies that the generated data distribution is very similar to the real data distribution. - Mathematical Formula: For two probability distributions and on a metric space
(M, d), theWasserstein distanceof order is defined as: $ W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{M \times M} d(x, y)^p , d\gamma(x, y) \right)^{1/p} $ Where:- is the distribution of real cell
embeddings. - is the distribution of generated cell
embeddings. - is the set of all joint distributions whose marginals are and on the first and second components, respectively.
d(x, y)is thedistance metricbetween points and in the embedding space (e.g., Euclidean distance).- The
infimumis taken over all possible ways to "transport" mass from to .
- is the distribution of real cell
Single-Cell Fréchet Inception Distance (scFID)
- Conceptual Definition:
scFIDis a novel metric introduced in this paper to evaluate the generative quality ofsingle-cell models. It is an adaptation of theFréchet Inception Distance (FID)[48] used in image generation. Instead of using features fromInception-v3(an image classification model),scFIDleveragesembeddingsfrom asingle-cell foundation model(likescGPT) to compare the distributions of real and generated cellembeddingsin a biologically meaningful feature space. A lowerscFIDscore indicates that the generated cells are more biologically realistic and similar to real cells. - Mathematical Formula: Given two sets of single-cell
embeddings—one from real cells () and one from generated cells ()—scFIDis defined as: Where:- and are the
mean vectorsof the real and generated cellembeddings, respectively. These capture the average position of the data cloud. - and are the
covariance matricesof the real and generated cellembeddings, respectively. These capture the spread and correlation structure of the data. - denotes the squared Euclidean distance (L2 norm) between the mean vectors.
- denotes the
traceof a matrix (sum of its diagonal elements). - refers to the matrix square root of the product of the covariance matrices.
- and are the
Kendall's (Tau)
- Conceptual Definition:
Kendall's\tau
is a non-parametric statistic that measures the ordinal association between two ranked lists. It quantifies the similarity of the orderings of data when ranked by quantities. In `perturbation response prediction`, it can be used to compare the rank-ordering of gene expression changes (e.g., genes up- or down-regulated) between predicted and actual responses. A higher value (closer to 1) indicates stronger agreement in rank order.
* **Mathematical Formula:** For two paired observations , `Kendall's`\tau
is calculated as:
$
\tau = \frac{N_c - N_d}{\frac{1}{2}n(n-1)}
$
Where:
* is the number of concordant pairs: pairs where the relative order of and is the same. For any two pairs and with , if and , or if and , they are concordant.
* is the number of discordant pairs: pairs where the relative order of and is different. If and , or if and , they are discordant.
* is the total number of distinct pairs.
Pearson's (Pearson Correlation Coefficient)
- Conceptual Definition:
Pearson'smeasures the linear correlation between two sets of data. It is a measure of the strength and direction of a linear relationship between two variables. Inperturbation response prediction, it would quantify the linear relationship between predicted gene expression values and actual gene expression values after a perturbation. Values range from -1 (perfect negative linear correlation) to 1 (perfect positive linear correlation), with 0 indicating no linear correlation. - Mathematical Formula:
$
r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}
$
Where:
- and are individual data points (e.g., predicted and actual gene expression values for gene ).
- and are the mean values of the respective datasets.
- is the number of data points.
5.3. Baselines
C2S-Scale is compared against a comprehensive set of baselines, including specialized single-cell models, general-purpose LLMs, and other perturbation prediction models.
Specialized Single-Cell Models
- scGPT [4]: A
transcriptomic foundation modelthat represents genes astokensand uses amasked gene modelingobjective. It's a key benchmark forscFMs. - Geneformer [5]: Another
scFMthatrank-ordersgenes and uses amasked modeling objective. - scGen: A
deep learning modeldesigned for single-cellperturbation predictionanddata integration[69]. - CellOT: A
deep learning modelthat uses optimal transport for modelingperturbation effectsinsingle-cell data[47].
General-Purpose Large Language Models (LLMs)
-
Llama [20, 21, 26]: A family of open-source
LLMsfrom Meta, known for strong performance across various NLP tasks. -
GPT-4o [22] (and GPT-4.5): Advanced
LLMsfrom OpenAI, recognized for their high capabilities innatural language understandingandgeneration.GPT-4.5was also used forQA datasetgeneration. -
Gemini [23]: A family of multimodal
LLMsfrom Google, offering strong performance in reasoning and generation. -
Meditron [27]: An open-source
LLMspecifically adapted for clinical practice in medical domains. -
BioMistral [28]: A collection of open-source
LLMspre-trainedfor medical domains, demonstrating adaptation ofLLMsto biological/medical text.These baselines represent the state-of-the-art in both specialized single-cell analysis and general
LLMcapabilities, allowing for a thorough evaluation ofC2S-Scale's performance and generalization across diverse tasks.
6. Results & Analysis
6.1. Core Results Analysis
The results presented demonstrate C2S-Scale's superior performance, scalability, and versatility across a wide array of single-cell analysis tasks, often outperforming both specialized scFMs and general-purpose LLMs.
6.1.1. Performance Scaling Laws
The paper establishes scaling laws for LLMs in single-cell analysis, showing consistent improvements with increasing model size and data.
-
Model Capacity: As model size increased from 410 million to 27 billion parameters (using
Gemma-2andPythiaarchitectures),C2S-Scaledemonstrated significant improvements in both predictive and generative tasks. This indicates that more parameters allow the model to capture more complex biological relationships. -
Data Size: Performance also scaled positively with the number of training samples seen by the model.
-
Parameter-Efficient Regimes: The improvements were observed even in
parameter-efficient regimes(e.g., usingLoRAforfine-tuning), highlighting the practical utility of scaling even with limited computational resources.The following figure (Figure 4 from the original paper) illustrates the performance scaling of
C2S-Scalemodels:
该图像是示意图,展示了基于Cell2Sentence(C2S)框架的单细胞RNA测序数据建模流程。图中左侧描述了细胞句子生成及其任务,包括细胞类型识别和生物摘要,要点涉及不同细胞样本和对应的基因表达信息。右侧展示了样本生成的效果,包括来自不同组织条件与数据集摘要的生成。中部的图表(C、D、E)展示了不同参数数量的模型在BERT得分上的表现,反映了模型在多任务中的潜力。
Figure 4: C2S-Scale integrates consistent scaling performance across a variety of single-cell analysis tasks. (A) C2S generates gene expression profiles and natural language prompts and responses for tasks in (A), colored by expression generation (red), predictive (blue), and language generation (green) tasks. (C) Performance scaling of fine-tuned C2S models on conditional sample generation, cell type annotation, tissue sample annotation, and dataset interpretation. (D) LoRA-finetuned C2S-Scale-2B and 27B models demonstrate performance scaling with increased mode capacity in the parameter-efficient regime. (E) Performance scaling by number of training samples seen by C2S-Scale-27B.
6.1.2. State-of-the-Art Predictive and Generative Capabilities
C2S-Scale achieves state-of-the-art results across several fundamental single-cell analysis tasks.
- Cell Embeddings:
C2S-Scalegenerates richcell embeddingsthat consistently capture morebiologically meaningful representationscompared to other models, even withoutre-trainingonbulk data. This is attributed to thecell sentencetransformation effectively preserving crucial biological information. - Generative Tasks: Unlike most other
transcriptomic foundation modelsthat require architectural modifications for generative tasks,C2S-Scaleexcels ingenerative tasks(e.g.,perturbation response prediction,conditional cell generation) without such changes, leveraging itsLLMarchitecture. - Outperformance of SOTA LLMs: For tasks involving
reasoning about scRNA-seq data(e.g.,cluster captioning,dataset interpretation,question answering),C2S-Scaleconsistently outperforms leadinggeneral-purpose LLMslikeLlama,GPT-4o, andGemini. - Generalization to Unseen Data: A key finding is
C2S-Scale's ability to generalize effectively to completely unseenscRNA-seq datasets, producing relevant and informative summaries and predictions. This highlights its robustnatural language understandingofscRNA-seqdata.
6.1.3. Natural Language Interpretation at Multiple Scales of Biology
C2S-Scale demonstrates unique capabilities in natural language interpretation of scRNA-seq data, bridging raw transcriptomics with biological literature.
-
Cluster Captioning: The model can automatically generate meaningful
natural language captionsfor groups of cells from the same tissue and batch, achieving highBioBERTScoresimilarity with ground-truth captions even on unseen data clusters. -
Dataset Interpretation: At a broader scale,
C2S-Scalecan interpret entirescRNA-seq datasetsand generate high-level summaries. It achieves the highestBERTScoreamong all evaluated models (includingLLaMA,Meditron,BioMistral,Gemini,GPT-4o) for this task, even on entirely unseen datasets. This capability allows researchers to gain biologically meaningful insights innatural language.The following figure (Figure 5 from the original paper) showcases
C2S-Scale's natural language interpretation abilities at multiple scales:
该图像是一个示意图,展示了生物学的不同层级以及与单细胞分析相关的多种实验结果,包括细胞类型注释、聚类标题和数据集解释等。在 B 部分,展示了真实细胞类型与预测细胞类型之间的准确性对比。C、D、E 部分则展示了不同大型语言模型在特定任务中的表现,用 BERT 分数进行比较。
Figure 5: C2S-Scale enables natural language interpretation of scRNA-seq data at multiple scales of biology. (A) Illustration of natural language interpretation at cell type, cluster, and dataset levels. (B) Ground-truth and predicted cell types for cells extracted from 6 different tissues across multiple human tumors [18]. C2S-Scale achieves high accuracy in this task. (C) Performance of C2S-Scale and baseline models on unseen scRNA-seq data clusters. Models are given multi-cell context from unseen data clusters and asked to caption them. Performance is measured by BioBERTScore. (D) C2S-Scale outperforms baselines for dataset interpretation on unseen abstracts for dataset interpretation on unseen abstracts from unseen scRNA-seq datasets. Error bars represent standard deviation across test set samples.
6.1.4. Spatial Reasoning from Multi-cell Context and Interaction Data
C2S-Scale learns spatial organization and cellular interactions without requiring architectural modifications, simply by processing multi-cell context.
-
Neighborhood Prediction: By sampling and encoding cells from shared neighborhoods,
C2S-Scalecan inferspatial relationships. It significantly outperformsscGPTandGPT-4oinneighborhood predictiontasks on spatially-resolvedscRNA-seq data(CosMx dataset). -
Integration of External Biological Knowledge: The model effectively incorporates external biological knowledge, specifically
gene interaction networksfromCellPhoneDB(receptor-ligand interactions) andBioGRID(protein-protein interaction data). This integration, without predefining rules, further enhancesspatial reasoningandniche predictionperformance.C2S-Scaleapplies this knowledge dynamically, demonstrating improved performance when these data sources are added individually or combined.The following figure (Figure 6 from the original paper) details
C2S-Scale's spatial reasoning capabilities:
该图像是示意图,展示了用于单细胞分析的不同任务,包括细胞生态位预测、邻居生成和邻里预测等。此外,B部分列出了基因表达和蛋白质相互作用的数据库,C部分则表现了空间预测的准确性比较。公式为 。
Figure 6: C2S-Scale can interpret multi-cell spatial context and predict niche neighborhoods. (A) We train C2S-Scale on a variety of single-cell and multi-cellular spatial tasks designed to enable C2S-Scale to perform spatial reasoning. Tasks include niche label prediction, conditional neighbor generation, and spatial neighborhood prediction. (B) We include publicly available receptor-ligand and protein-protein interaction data from CellPhoneDB and BioGRID, restricted to extracellular proteins in the CosMx dataset. (C) C2S outperforms scGPT and GPT-4o in spatial neighborhood identification accuracy. Additionally, integrating gene interactions from BioGRID and CellPhoneDB individually improves performance, and their combination provides the greatest improvement. The results highlight the multi-task transfer potential of C2S-Scale for spatially-aware biological modeling.
6.1.5. Single-Cell Question Answering (QA) through Reinforcement Learning
C2S-Scale sets a new standard for single-cell QA, particularly with the aid of Reinforcement Learning.
-
Outperformance of SOTA LLMs:
C2S-Scaledemonstrates superiorsingle-cell question answeringperformance compared to state-of-the-artLLMslikeLlama,GPT-4o, andGemini. -
GRPO Enhancement:
Group Relative Policy Optimization (GRPO)significantly improvesC2S-Scale'sQA capabilities. By usingBioBERTScoreas areward function,GRPOguides the model to produce higher-quality, biologically aligned answers, substantially outperforming theSFT-basedmodel.The following figure (Figure 7 from the original paper) illustrates
C2S-Scale'sQAperformance:
该图像是示意图,展示了 C2S 框架用于单细胞 RNA 测序数据的分析。图 A 提供了背景及示例细胞句子,图 B 描述了组相对策略优化的过程,图 C 则比较了 C2S-Scale 与其他语言模型的表现。
Figure 7: C2S-Scale demonstrates superior single-cell question answering performance compared to state-of-the-art (SOTA) LLMs. (A) Example QA scenario based on scRNA-seq data. (B) Overview of the GRPO framework [17], where the LLM is trained to improve predicted responses based on a reward model. (C) Comparison of C2S-Scale vs. SOTA LLMs on single-cell QA tasks, highlighting C2S-Scale's advantage in domain-specific reasoning. Error bars represent standard deviation across test set QA samples.
6.1.6. Perturbation Response Prediction
C2S-Scale accurately predicts cellular responses to perturbations, outperforming existing specialized methods.
- Superior Accuracy:
C2S-Scaleaccurately predictsconditional responsestoperturbationsacross entiregene expression profiles, surpassingscGen,CellOT, andscGPTin metrics likeMMD,Wasserstein distance, andscFID. - Generalization to Unseen Perturbations: The model effectively generalizes to completely unseen combinatorial perturbations and various cell types/cytokine combinations, capturing non-linear synergies.
- GRPO Refinement:
GRPOfurther refinesperturbation response generation, particularly for specificgene programsof interest (e.g., apoptosis inL1000data, interferon-related genes in cytokine stimulation). This leads to notable improvements inKendall's\tau
and `Pearson's `, enhancing the biological fidelity and generalization to out-of-distribution settings.
The following figure (Figure 8 from the original paper) details the `perturbation prediction` results:

*该图像是图8,展示了C2S-Scale模型在预测细胞对未见扰动响应方面超越现有方法的结果。图中包含多个细胞模型的对比,包括Ground Truth、C2S、scGen、scGPT和CellOT,指标评分如scFID和Wasserstein Score等被用于评估模型性能。*
Figure 8: Overview of the `C2S-Scale perturbation prediction framework`, which supports diverse `perturbation types`. (A) `C2S-Scale` excels at `perturbation prediction`, outperforming all existing methods. (B) `C2S-Scale` aligns `predicted responses` with `ground truth` in unseen `combinatorial perturbations` (interferon- + `IL-6`). `C2S-Scale` outperforms `scGPT` and `CellOT` in `perturbation prediction` for `CD4 T cells` and `B cells`. (C) `Prompt` and `response example` for `perturbation prediction`. (D) `APCs` comparing `predicted vs. ground-truth responses` for unseen `perturbations` across four models. Rows show: (1) all `combinatorial perturbations`, (2) `CD4 T cells` under `IFN-`\gamma
, (3) B cells under the held-out IFN-\beta+ IL-6 stimulation. C2S-Scale aligns closely with ground truth in all cases. (E) Benchmark metrics show C2S-Scale outperforms scGen, scGPT, and CellOT across several metrics (MMD, Wasserstein, scFID) for perturbation prediction for a large number of held-out perturbations. (F) Overview of the GRPO framework for perturbation prediction, where the LLM is trained to improve predicted responses and receive rewards based on gene program similarity. (G) GRPO improves over SFT on L1000 (apoptosis response) and cytokine stimulation (interferon response) tasks, with gains in Kendall's\tau
, `Pearson's r`, and `scFID`.
## 6.2. Data Presentation (Tables)
The following are the results from `Table 1` of the original paper, showing the `MultiTask Prompt Formats` used in `C2S-Scale` `pre-training`:
<div class="table-wrapper"><table>
<thead>
<tr>
<th>Task name</th>
<th>Type</th>
<th>Input information</th>
<th>Target output</th>
<th>Metric</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single cell language modeling</td>
<td>Single-cell</td>
<td></td>
<td>Single cell sentence</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Cell type annotation</td>
<td>Single-cell</td>
<td>Single cell sentence</td>
<td>Cell type</td>
<td>BertScore</td>
</tr>
<tr>
<td>Conditional cell generation</td>
<td>Single-cell</td>
<td>Cell type of one cell</td>
<td>Single cell sentence</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Multiple cell language modeling</td>
<td>Multi-cell</td>
<td></td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Tissue sample annotation</td>
<td>Multi-cell</td>
<td>Multiple cell sentences</td>
<td>Tissue label</td>
<td>BertScore</td>
</tr>
<tr>
<td>Sample cell type(s) annotation</td>
<td>Multi-cell</td>
<td>Multiple cell sentences</td>
<td>Cell types of multiple cells</td>
<td>BertScore</td>
</tr>
<tr>
<td>Conditional sample generation (tissue)</td>
<td>Multi-cell</td>
<td>Tissue annotation</td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Conditional sample generation (cell type)</td>
<td>Multi-cell</td>
<td>Cell types of multiple cells</td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Conditional sample generation (abstract)</td>
<td>Multi-cell</td>
<td>Paper abstract</td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Natural language interpretation</td>
<td>Multi-cell</td>
<td>Multiple cell sentences</td>
<td>Paper abstract</td>
<td>BertScore</td>
</tr>
<tr>
<td>Gene set enumeration</td>
<td>Gene set</td>
<td>Gene set name</td>
<td>List of genes in gene set</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Gene set naming</td>
<td>Gene set</td>
<td>List of genes in gene set</td>
<td>Gene set name</td>
<td>BertScore</td>
</tr>
</tbody>
</table></div>
## 6.3. Ablation Studies / Parameter Analysis
The paper primarily explores the impact of `model capacity` and `training data size` as forms of `ablation` or `parameter analysis` on `C2S-Scale`'s performance.
* **Model Capacity Scaling (Figure 4C, 4D):**
* `C2S-Scale` models, ranging from 410M to 27B parameters, consistently show improved performance on tasks like `conditional sample generation`, `cell type annotation`, `tissue sample annotation`, and `dataset interpretation` as the number of parameters increases. This validates the `scaling laws` for `LLMs` in this domain.
* Even in the `parameter-efficient regime` (using `LoRA`), larger base models (e.g., `C2S-Scale-27B` vs `C2S-Scale-2B`) demonstrate superior performance, indicating that the benefits of increased model capacity persist even with efficient `fine-tuning` techniques.
* **Training Data Size Scaling (Figure 4E):**
* Performance continues to improve as the `C2S-Scale-27B` model sees more training samples. This highlights the importance of the massive 1-billion token multimodal corpus for `C2S-Scale`'s capabilities.
* **Impact of Reinforcement Learning (GRPO) (Figure 7C, 8G):**
* For `single-cell Question Answering`, `GRPO` significantly boosts performance compared to the `SFT` baseline and other `SOTA LLMs`, demonstrating the effectiveness of `RL` in aligning `LLM` outputs with biological preferences.
* In `perturbation response prediction`, `GRPO` leads to notable improvements in `Kendall's`\tau
, Pearson's , and scFID, particularly for gene programs of interest (e.g., apoptosis, interferon response). This confirms that GRPO refines the model's ability to generate biologically faithful and relevant perturbation responses.
- External Biological Knowledge for Spatial Reasoning (Figure 6C):
-
The paper shows that incorporating
receptor-ligandinteractions (CellPhoneDB) orprotein-protein interactions(BioGRID) individually improvesspatial neighborhood identification accuracy. Combining both datasets yields the greatest performance gain, confirming the hypothesis that external molecular context enhancesspatial reasoninginC2S-Scale.These analyses collectively underscore that
C2S-Scalebenefits substantially from increases inmodel capacity,training data size, andtargeted refinementthrough advanced techniques likeGRPOandexternal knowledge integration.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces C2S-Scale, a pioneering framework that leverages Large Language Models (LLMs) for next-generation single-cell analysis. By building upon the transformation, which represents scRNA-seq profiles as textual "cell sentences," the authors have successfully integrated transcriptomic data with biological text and metadata at an unprecedented scale. C2S-Scale models, scaled up to 27 billion parameters, demonstrate consistent performance improvements following LLM scaling laws, proving highly effective in both predictive and generative tasks.
The key contributions include:
-
Scalability and Multimodality:
C2S-Scaleoffers superior scalability and flexibility by operating within theLLMparadigm, natively integrating diverse data types (transcriptomics, biological text, metadata) into a unified framework. -
State-of-the-Art Performance: It achieves state-of-the-art results across a broad range of challenging tasks, including
perturbation response prediction,natural language interpretation,spatial reasoning, andcomplex biological question answering, consistently outperforming both specializedsingle-cell modelsand general-purposeLLMs. -
Reinforcement Learning for Alignment: The application of
Group Relative Policy Optimization (GRPO)significantly refines model outputs, aligning them with biological insights and expert preferences. -
Novel Evaluation Metric: The introduction of
single-cell Fréchet Inception Distance (scFID)provides a more biologically meaningful way to evaluate generativesingle-cell models. -
Open Science: The commitment to open-sourcing models and resources will accelerate future research in the field.
Ultimately,
C2S-Scaleestablishes a powerful platform fornext-generation single-cell analysis, moving towards the exciting prospect of "virtual cells" capable of complex biological reasoning and simulation.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
7.2.1. Addressing Limitations of Causal Attention in Gene Expression Modeling
- Causal Attention Constraint:
LLMstypically usecausal attention, meaningtokenscan only attend to precedingtokensin a sequence. While thecell sentencetransformationrank-ordersgenes by expression (high-to-low), this inherentcausal attentionbias might theoretically limit the modeling of true causal biological interactions that flow fromlow- to high-expression genes(e.g., a lowly expressed transcription factor regulating a highly expressed target gene). - Authors' Argument: They contend that this constraint does not significantly impede current utility. They suggest that the
LLM's generalreasoning capabilitiescan compensate forcausal attentionlimitations, similar to howLLMsinNLPsuccessfully model statistical relations despiteword ordernot perfectly aligning withcausal dependenciesin language. - Future Architectural Enhancements: The authors propose three architectural enhancements to mitigate this
causal invarianceand improve biological fidelity:Bidirectional attention: Allowingtokensto attend to both preceding and succeedingtokens.Non-causal attention: Removing the directional constraint entirely.Hybrid attention architectures: Blendingcausalandnon-causal attentionto selectively model dependencies.
- Incorporating Biological Priors: They also suggest integrating
external biological knowledge(e.g., known regulatory pathways) directly into the model to provide additional context beyond whatcausal attentioncan infer.
7.2.2. Hallucination and Interpretability
- Hallucination: Like all
LLMs,C2S-Scaleis susceptible tohallucinations(generating plausible but incorrect or nonsensical information). While less problematic for highly structured tasks likecell type prediction, more open-endedinterpretation tasks(e.g.,abstract generation,cluster captioning) are more vulnerable to errors. - Interpretability: The inherent "black-box" nature of
LLMsmakes it challenging to fully understand why the model makes certain predictions or generates specific text. - Mitigation: The authors suggest
RLandprompt engineeringas avenues to improveinterpretabilityand reducehallucinationsby aligning the model with desired output characteristics and expert feedback.
7.3. Personal Insights & Critique
The C2S-Scale paper presents a highly innovative and compelling approach to single-cell analysis, effectively bridging the gap between high-dimensional genomic data and natural language.
Strengths and Innovations:
- Brilliant Transformation: The
Cell2Sentencetransformation is a stroke of genius. It's a simple yet powerful abstraction that allows the entireLLMecosystem to be immediately applicable to a complex biological domain. This sidesteps the need for customTransformerdesigns forscRNA-seq, leveraging decades ofNLPresearch and readily availableLLMarchitectures. - Leveraging LLM Scaling Laws: The paper's empirical validation of
LLM scaling lawsin thesingle-cell domainis a significant finding. It suggests that continued investment in larger models and datasets will yield predictable performance gains, a crucial insight for future research. - Multimodality is Key: The native integration of
transcriptomics,metadata, andbiological textis transformative. It allowsC2S-Scaleto operate at a higher level of biological reasoning, moving beyond pure pattern recognition to contextual understanding. This capacity fornatural language interpretationis particularly exciting for making complex biological data more accessible to a broader scientific community. - Reinforcement Learning for Domain Alignment: The strategic use of
GRPOto fine-tuneLLMoutputs againstdomain-specific metrics(likeBioBERTScoreandscFID) is a robust way to ensure biological fidelity and address potentialhallucinationsor misinterpretations. This active alignment process is crucial for deployingLLMsin high-stakes scientific applications. - The "Virtual Cell" Vision: The concept of "virtual cells" is ambitious and inspiring.
C2S-Scalelays a foundational stone for creating models that can not only predict but also reason about and simulate cellular behavior, accelerating drug discovery and personalized medicine.
Potential Issues and Areas for Improvement:
- Causal Attention Limitation (as acknowledged): While the authors address the
causal attentionlimitation, its true impact on modeling complexgene regulatory networks(wherelow-expression regulatorsoften dictatehigh-expression targets) needs more thorough investigation. The proposed architectural enhancements are promising but require empirical validation. Relying solely onLLMreasoning capabilities to compensate might be an optimistic assumption for precise causal inference. - Interpretability Challenge (as acknowledged): Despite the advances, the "black-box" nature of
LLMsremains a challenge. For biologists, understanding why a prediction is made is often as important as the prediction itself. Future work onexplainable AI (XAI)forC2S-Scalewould be invaluable to build trust and facilitate biological discovery. - Data Bias and Generalizability: The training data, while massive, is still a representation of existing biological knowledge. Biases present in public
scRNA-seqatlases orbiological literaturecould propagate intoC2S-Scale's representations and reasoning. Continuous auditing and expansion with diverse, unbiased datasets will be necessary. - Computational Cost: Training and deploying 27B parameter models, even with
LoRAandgradient checkpointing, is computationally intensive. Whilescaling lawspromise benefits, accessibility for smaller labs might remain a barrier. Open-sourcing helps, but resource requirements are still high. - Specificity of "Cell Sentences": While
rank-orderingpreserves relative expression, it loses absolute expression values and the quantitative differences between genes at different ranks. Although Figure 9 suggests minimal information loss for recovery, the impact on subtle biological inferences (e.g., dose-response relationships) should be continuously evaluated.
Applicability to Other Domains:
The Cell2Sentence paradigm has immense transfer potential beyond single-cell RNA-seq. Any high-dimensional, sparse data where features have meaningful "names" and where relative importance matters could be similarly transformed into a "language" for LLMs. Examples include:
-
Proteomics/Metabolomics: Representing protein or metabolite profiles as "sentences" of ranked abundances.
-
Genomics (beyond expression): Encoding genomic variants or epigenomic marks as ordered sequences of features.
-
Clinical Data: Transforming patient records (symptoms, diagnoses, treatments) into structured textual narratives for
LLM-based reasoning. -
Materials Science: Representing material compositions or structural properties as "sentences" of elements or motifs.
The
C2S-Scalepaper is a landmark achievement, demonstrating that theLLMparadigm can profoundly reshape how we analyze and interpret complex scientific data, ushering in an era ofAI-driven scientific discovery.
Similar papers
Recommended via semantic vector search.