Paper status: completed

SCALING LARGE LANGUAGE MODELS FOR NEXT-GENERATION SINGLE-CELL ANALYSIS

Published:04/17/2025

Large Language Model Fine-Tuning (50)Single-Cell RNA Sequencing (1)Cell Text Modeling (1)Biological Information Synthesis (1)Multicellular Context Reasoning (1)

Original Link

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces a novel approach using the Cell2Sentence framework to convert single-cell RNA sequencing data into textual 'cell sentences,' training large language models on over a billion tokens. Scaling to 27 billion parameters resulted in enhanced performance in multice

Abstract

Single-cell RNA sequencing has transformed our understanding of cellular diversity, yet current single-cell foundation models (scFMs) remain limited in their scalability, flexibility across diverse tasks, and ability to natively integrate textual information. In this work, we build upon the Cell2Sentence (C2S) framework, which represents scRNA-seq profiles as textual “cell sentences,” to train Large Language Models (LLMs) on a corpus comprising over one billion tokens of transcriptomic data, biological text, and metadata. By scaling model size to 27 billion parameters, we observe consistent improvements in predictive and generative capabilities, as well as the capacity for advanced downstream tasks requiring synthesis of information across multicellular contexts. Through targeted fine-tuning supported by modern reinforcement learning techniques, our approach excels in tasks such as perturbation response prediction, natural language interpretation, and complex biological reasoning. By unifying transcriptomic and textual data at unprecedented scales, this approach not only surpasses both specialized single-cell models and general-purpose LLMs, but also establishes a powerful platform for next-generation single-cell analysis, paving the way for the development of “virtual cells.”

Mind Map

In-depth Reading

English Analysis~36 min read · 48,028 chars

1. Bibliographic Information

1.1. Title

SCALING LARGE LANGUAGE MODELS FOR NEXT-GENERATION SINGLE-CELL ANALYSIS

1.2. Authors

The paper lists numerous authors primarily affiliated with Yale University and Google Research/Google DeepMind.

Lead/Corresponding Authors: Syed Asad Rizvi (Yale University, Google Research), Shekoofeh Azizi (Google DeepMind), Bryan Perozzi (Google Research), David van Dijk (Yale University).
Other Key Contributors: Daniel Levine, Aakash Patel, Shiyang Zhang, Eric Wang, Sizhuang He, David Zhang, Cerise Tang, Zhuoyang Lyu, Rayyan Darji, Chang Li, Emily Sun, David Jeong, Lawrence Zhao, Jennifer Kwan, David Braun, Brian Hafler, Jeffrey Ishizuka, Rahul M. Dhodapkar, Hattie Chung.

Their research backgrounds span various fields including computer science, artificial intelligence, bioinformatics, and biomedical research, reflecting the interdisciplinary nature of this work.

1.3. Journal/Conference

This paper is currently a preprint, indicated by "A PREPRINT - APRIL 15, 2025" and the common use of bioRxiv citations (e.g., [13, 14]). Preprints are scientific manuscripts posted online before peer review, allowing for early dissemination of research findings. While not yet peer-reviewed, this indicates active research in the field, often from reputable institutions like Yale University and Google Research, which are highly influential in AI and biomedical research.

1.4. Publication Year

April 15, 2025

1.5. Abstract

Single-cell RNA sequencing (scRNA-seq) has revolutionized the understanding of cellular diversity. However, existing single-cell foundation models (scFMs) face limitations in scalability, flexibility across diverse tasks, and native integration of textual information. This work addresses these challenges by building upon the $Cell2Sentence (C2S)$ framework, which represents scRNA-seq profiles as textual "cell sentences." The authors trained Large Language Models (LLMs) on a massive corpus exceeding one billion tokens, comprising transcriptomic data, biological text, and metadata. By scaling model size up to 27 billion parameters, they observed consistent improvements in both predictive and generative capabilities, alongside enhanced capacity for advanced downstream tasks that require synthesizing information across multicellular contexts. Through targeted fine-tuning, supported by modern reinforcement learning (RL) techniques like Group Relative Policy Optimization (GRPO), their approach demonstrates superior performance in tasks such as perturbation response prediction, natural language interpretation, and complex biological reasoning. This unification of transcriptomic and textual data at unprecedented scales not only surpasses both specialized single-cell models and general-purpose LLMs but also establishes a powerful platform for next-generation single-cell analysis, laying the groundwork for the development of "virtual cells."

1.6. Original Source Link

The original source link is /files/papers/6915ae034d6b2ff314a02e95/paper.pdf. This is a local path, and the paper explicitly states it is a preprint.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a crucial problem in single-cell analysis: despite the transformative power of single-cell RNA sequencing (scRNA-seq) in revealing cellular diversity, existing single-cell foundation models (scFMs) are limited. These limitations manifest in three key areas:

Scalability: Current models struggle to effectively scale to the ever-increasing volume and complexity of single-cell data.
Flexibility Across Diverse Tasks: They often lack the versatility to perform a broad range of downstream biological analyses without significant architectural modifications or retraining.
Native Integration of Textual Information: Critically, they generally cannot inherently incorporate rich biological text, metadata, and annotations, which are vital for contextualizing and interpreting genomic data.

This problem is important because scRNA-seq generates vast amounts of data, and extracting meaningful biological insights requires models that can handle this scale and integrate all available information, including the extensive existing biological literature. Prior transcriptomic foundation models like scGPT [4], Geneformer [5], scFoundation [6], and scGeneT [7] have shown promise but still face these challenges.

The paper's innovative idea is to leverage the robust scaling behavior and flexible architecture of Large Language Models (LLMs) by transforming high-dimensional scRNA-seq data into a textual format compatible with LLMs. This is achieved through the $Cell2Sentence (C2S)$ framework, which represents scRNA-seq profiles as sequences of gene names ordered by expression level, effectively creating "cell sentences." This approach positions single-cell data within the LLM framework, offering inherent advantages in scalability and flexibility compared to specialized model architectures.

2.2. Main Contributions / Findings

The primary contributions of this work, C2S-Scale, are multifaceted and establish a new paradigm for single-cell analysis:

Scaling Single-Cell Analysis with LLMs:
- Larger Model Capacity: Introduction of the C2S-Scale family, with models ranging from 410 million to 27 billion parameters, based on Gemma-2 [15] and Pythia [16] architectures. This significantly increases model capacity to capture complex biological relationships.
- Increased Performance at Scale: Demonstration of clear scaling laws for LLMs in single-cell analysis, showing consistent performance improvements in both predictive and generative tasks as model size increases.
- Massive Data Size and Multimodality: Training on a 1-billion token multimodal corpus from over 50 million human and mouse cells, integrating transcriptomic data with corresponding biological text (e.g., paper abstracts) and metadata. This aligns single-cell data with natural language and biological context.
- Long-Context, Multi-Cell Capabilities: Support for extended context lengths (up to 8192 tokens) enables processing of comprehensive multimodal and multi-cell inputs simultaneously, facilitating analysis of cellular interactions and complex biological processes.
- Diverse Downstream Applications: C2S-Scale excels across a significantly broader range of challenging biological reasoning tasks, including perturbation prediction, nuanced natural language interpretation of single-cell data, and complex question answering.
Reinforcement Learning for Enhanced Performance: The paper leverages Group Relative Policy Optimization (GRPO) [17], a modern reinforcement learning technique, to further refine C2S-Scale for targeted single-cell tasks, demonstrating significant performance improvements in question answering and perturbation response prediction.
A Novel Metric for Evaluating Single-Cell Generative Models: Introduction of the single-cell Fréchet Inception Distance (scFID), which leverages single-cell foundation model embedding space to assess the quality of generated cells in a biologically meaningful way, unlike traditional expression-level metrics.
Open-Source Models and Resources: The authors commit to releasing their code and model weights, along with transcriptomic-language integrated datasets and prompts, to foster community research and development.

In summary, C2S-Scale unifies transcriptomic and textual data at unprecedented scales using LLMs, surpassing both specialized single-cell models and general-purpose LLMs. It provides a powerful platform for next-generation single-cell analysis, moving towards the concept of "virtual cells" where complex biological phenomena can be simulated and understood with greater depth.

3.1. Foundational Concepts

To understand C2S-Scale, a foundational grasp of single-cell RNA sequencing, Large Language Models, and Reinforcement Learning is essential.

Single-Cell RNA Sequencing (scRNA-seq)

Conceptual Definition: scRNA-seq is a high-throughput technology that measures gene expression levels in individual cells. Unlike bulk RNA-seq, which averages gene expression across a population of cells, scRNA-seq provides a snapshot of gene activity within each cell, revealing cellular heterogeneity and enabling the discovery of rare cell types, developmental trajectories, and cell-state transitions.
Importance: It has revolutionized biology by allowing researchers to explore cell diversity in tissues, understand disease mechanisms, and identify new therapeutic targets.
Data Characteristics: scRNA-seq data is typically high-dimensional (thousands of genes), sparse (many zero counts per cell), and noisy. The output is often represented as an expression matrix, where rows are genes and columns are cells, with entries representing the expression level of each gene in each cell.

Large Language Models (LLMs)

Conceptual Definition: LLMs are a class of deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They learn complex patterns, grammar, and semantic relationships within language.
Transformer Architecture [8]:
- Attention Mechanism: The core innovation of Transformers is the attention mechanism, which allows the model to weigh the importance of different parts of the input sequence (e.g., words in a sentence) when processing each element.
- Self-Attention: A specific type of attention where the model relates different positions of a single sequence to compute a representation of the sequence. For example, when translating "The animal didn't cross the street because it was too tired," self-attention helps the model determine that "it" refers to "animal."
- Multi-Head Attention: Multiple attention mechanisms run in parallel, each focusing on different aspects of the relationships within the sequence, and their outputs are concatenated and linearly transformed.
- Encoder-Decoder Structure: Original Transformers have an encoder (processes input) and a decoder (generates output). Many LLMs simplify this, often using a decoder-only structure for generative tasks (like generating text or, in this paper, cell sentences).
- Tokenization: Text is broken down into tokens (words, sub-words, or characters).
- Embeddings: Tokens are converted into high-dimensional numerical vectors (embeddings) that capture their semantic meaning. These embeddings are learned during training.
- Positional Encoding: Since Transformers do not inherently understand word order, positional encodings are added to the embeddings to provide information about the relative or absolute position of tokens in the sequence.
Pre-training: LLMs are typically pre-trained on massive text corpora using objectives like next token prediction (predicting the next word in a sequence given the previous ones). This unsupervised learning phase allows them to acquire a broad understanding of language.
Fine-tuning: After pre-training, LLMs are fine-tuned on smaller, task-specific datasets to adapt them to particular downstream applications (e.g., sentiment analysis, question answering).
Scaling Laws [11]: LLMs exhibit predictable scaling laws, meaning that as model size (number of parameters), dataset size, and computational resources increase, their performance on various tasks consistently improves.

Reinforcement Learning (RL)

Conceptual Definition: Reinforcement Learning is a paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It learns through trial and error, rather than explicit instruction.
Reward Function: A critical component of RL that defines what constitutes a "good" or "bad" outcome for the agent.
Policy Optimization: RL algorithms aim to find an optimal policy, which is a mapping from observed states of the environment to actions, that maximizes the expected reward.

3.2. Previous Works

The paper builds upon and contrasts its C2S-Scale framework with several key prior studies:

Cell2Sentence (C2S) Framework [13, 14]

Core Idea: The foundational work for C2S-Scale. C2S proposes to represent scRNA-seq profiles as textual "cell sentences." This is achieved by:
1. Taking a cell's gene expression profile (a vector of expression values for thousands of genes).
2. Rank-ordering the genes by their expression levels from highest to lowest.
3. Selecting the top $K$ most highly expressed genes.
4. Concatenating the names of these $K$ genes into a sentence (e.g., "geneA geneB geneC...").
Purpose: This transformation makes scRNA-seq data compatible with LLMs, treating gene names as tokens and cell sentences as sequences that LLMs can process.
Why it's important for C2S-Scale: C2S-Scale is explicitly described as "the next generation of C2S models," significantly improving upon the original paradigm in scale, mode capacity, and downstream applications. The original C2S demonstrated that this textual transformation retains sufficient biological information (as shown in Figure 9, where original gene expression can be predicted from rank with high $R^2$ ).

Existing Single-Cell Foundation Models (scFMs)

scGPT [4]: A transcriptomic foundation model that uses a masked gene modeling objective. It has shown promise in modeling single-cell transcriptomic data but, like others, faces limitations in scalability, flexibility, and native textual integration compared to LLMs. scGPT is used in C2S-Scale as an embedding space for the scFID metric.
Geneformer [5]: Another transcriptomic foundation model that also rank-orders genes but uses a masked modeling objective. It aims to learn generalizable representations of single-cell data.
scFoundation [6] and scGeneT [7]: Other contemporary scFMs that have shown progress in modeling single-cell transcriptomic data at scale but are still constrained by the inherent architectural limitations that C2S-Scale aims to overcome through LLM integration. scGeneT specifically focuses on perturbation modeling.

General-Purpose LLMs

Llama [20, 21, 26], GPT-4o [22], Gemini [23], Meditron [27], BioMistral [28]: These are state-of-the-art general LLMs that C2S-Scale uses as benchmarks, particularly for natural language interpretation and question answering tasks. The paper demonstrates that C2S-Scale, while specialized, outperforms these general LLMs in domain-specific biological reasoning, highlighting the benefit of its specialized pre-training and C2S transformation.
- Example: Attention is All You Need [8]: This seminal paper introduced the Transformer architecture, which is the backbone for LLMs like Gemma-2 and Pythia used in C2S-Scale. The attention mechanism's core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequence after linear transformations.
  - $Q K^T$ calculates the dot product similarity between queries and keys.
  - $\sqrt{d_k}$ is a scaling factor, where $d_k$ is the dimension of the key vectors, used to prevent the dot products from becoming too large, which can push the softmax function into regions with very small gradients.
  - $\mathrm{softmax}$ converts the scores into probabilities, determining the weight of each Value vector.
  - The result is a weighted sum of the Value vectors, allowing the model to focus on relevant parts of the input. C2S-Scale leverages this mechanism to identify key genes or biological contexts within cell sentences.

Reinforcement Learning in NLP

Group Relative Policy Optimization (GRPO) [17]: This RL technique is employed by C2S-Scale to align LLM outputs with user preferences or domain-specific criteria. GRPO optimizes the model's policy by comparing multiple candidate outputs and reinforcing those that are ranked higher by a reward model. This is crucial for refining C2S-Scale's generative capabilities in complex biological tasks.
BERTScore [25]: A metric used to evaluate the quality of generated text by comparing it to reference text using contextual embeddings from BERT models. C2S-Scale uses BioBERTScore (a BERTScore variant adapted for biological text) as a reward function for GRPO to guide the LLM towards biologically relevant and high-quality answers.

Generative Model Evaluation

Fréchet Inception Distance (FID) [48]: A standard metric for evaluating the quality of images generated by Generative Adversarial Networks (GANs). It measures the distance between the feature distributions of real and generated images in a high-dimensional feature space (typically using Inception-v3 network features [49]).
single-cell Fréchet Inception Distance (scFID): C2S-Scale introduces this novel metric as an adaptation of FID for single-cell data. Instead of Inception-v3 features, scFID uses features (embeddings) from a single-cell foundation model (like scGPT) to compare real and generated cell embeddings in a biologically meaningful space.

3.3. Technological Evolution

The evolution of single-cell analysis has moved from basic clustering and visualization of scRNA-seq data to sophisticated computational models that aim to capture underlying biological processes. Initially, bioinformatics tools focused on data processing, dimension reduction, and statistical analysis. More recently, deep learning and foundation models have emerged:

Early Models: Often specialized neural networks or autoencoders for dimension reduction, imputation, or cell type annotation.
Specialized scFMs: Models like scGPT and Geneformer represented a step forward by attempting to learn generalizable representations from large scRNA-seq datasets, acting as "foundation models" for single-cell biology. However, these were still primarily designed for numerical, transcriptomic data and lacked native language processing capabilities.
The LLM Paradigm Shift: The success of LLMs in natural language processing, coupled with their robust scaling laws, inspired researchers to adapt them for other domains. The C2S framework was a crucial step in this direction, enabling scRNA-seq data to be framed as a "language."
C2S-Scale's Position: C2S-Scale represents the cutting edge by fully embracing the LLM paradigm, scaling model capacity, integrating diverse multimodal data (transcriptomics + text + metadata), and employing advanced RL fine-tuning. It attempts to bridge the gap between purely numerical biological data and human-readable biological knowledge, making it possible for LLMs to perform complex biological reasoning directly. This positions C2S-Scale as a leader in next-generation single-cell analysis, aiming for a future with "virtual cells."

3.4. Differentiation Analysis

C2S-Scale differentiates itself from previous works through several key innovations:

From Specialized Architectures to LLM-Native:
- Prior scFMs (e.g., scGPT, Geneformer): Often employ specialized Transformer-like architectures or custom neural network designs tailored for scRNA-seq data. While effective, these designs can be less flexible and harder to scale to the massive parameter counts seen in general LLMs. They also typically require significant architectural modifications for different tasks or modalities.
- C2S-Scale: Leverages general-purpose LLM architectures (Gemma-2, Pythia) directly. The Cell2Sentence transformation is the key; it converts scRNA-seq into a textual format, making it inherently compatible with LLMs without requiring custom architectural modifications. This allows C2S-Scale to benefit from the LLM ecosystem's rapid advancements in scaling, efficiency, and multimodality.
Native Multimodal Integration (Transcriptomics + Text):
- Prior scFMs: While some models might incorporate annotations or metadata as numerical features, they generally do not natively integrate free-form biological text (e.g., paper abstracts) with transcriptomic data in a unified language model framework.
- C2S-Scale: Explicitly trains on a massive corpus that includes both cell sentences (representing transcriptomics) and corresponding biological text and metadata. This allows the LLM to learn the semantic relationships between gene expression patterns and their textual descriptions, enabling natural language interpretation and biological reasoning tasks that are beyond the scope of traditional scFMs.
Unprecedented Scale:
- Prior scFMs: Typically operate at scales of hundreds of millions to a few billion parameters.
- C2S-Scale: Scales up to 27 billion parameters, a significant increase that allows it to capture more complex and subtle biological relationships, aligning with observed scaling laws in LLMs.
Enhanced Flexibility and Task Versatility:
- Prior scFMs: Often excel at specific tasks (e.g., cell type annotation, perturbation prediction) but may require re-training or significant fine-tuning for new, diverse tasks.
- C2S-Scale: Demonstrates versatility across a broad spectrum of tasks, from predictive (cell type annotation, perturbation prediction) and generative (conditional cell generation) to complex natural language interpretation and biological reasoning (question answering, cluster captioning), all within a unified framework.
Leveraging Reinforcement Learning for Alignment:
- Prior scFMs: Primarily rely on supervised or unsupervised learning objectives.
- C2S-Scale: Incorporates Reinforcement Learning (GRPO) to further align the model's outputs with desired biological outcomes and expert preferences, significantly improving performance on complex generative and interpretive tasks like question answering and perturbation response prediction.
Novel Evaluation Metric:
- Prior scFMs: Generative models are often evaluated using expression-level metrics (which can be noisy) or metrics adapted from other domains.
- C2S-Scale: Introduces scFID, a domain-specific metric that leverages scFM embedding space to provide a more biologically meaningful assessment of generated single-cell data.
  
  In essence, C2S-Scale is not just another scFM but rather a re-conceptualization of single-cell analysis using the full power and flexibility of LLMs, specifically designed to integrate the rich context of biological language with high-dimensional transcriptomic data.

4. Methodology

4.1. Principles

The core principle behind C2S-Scale is to transform single-cell RNA sequencing (scRNA-seq) data into a format that Large Language Models (LLMs) can natively understand and process. This "textualization" of scRNA-seq profiles, termed "cell sentences," allows the vast capabilities of LLMs (such as understanding context, generating coherent sequences, and performing complex reasoning) to be applied directly to biological data. By treating gene names as tokens and ordered gene lists as sentences, C2S-Scale leverages LLM architectures to learn complex biological relationships, integrate multimodal information (transcriptomics, text, metadata), and perform a wide range of downstream tasks, overcoming the scalability and flexibility limitations of traditional scFMs.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data Collection and Preparation

The first step involves assembling a massive and diverse corpus for pre-training.

Scale: Over 50 million single-cell transcriptomic profiles from human and mouse tissues.
Sources: Publicly available single-cell atlases such as CellxGene [2] and the Human Cell Atlas [3].
Multimodal Integration: Crucially, this dataset is enriched with associated annotations, metadata, and paper abstracts from biological publications. This ensures the model learns to align transcriptomic data with natural language and biological context. This multimodal corpus totals over 1 billion tokens.
Task Generation: This corpus is then curated into over 150 million multi-task training samples, enabling the LLM to learn diverse tasks while simultaneously integrating various forms of information.

4.2.2. Cell Sentence Transformation

This is the pivotal step that enables the integration of scRNA-seq data with LLMs.

Input: For each cell, the raw input is an expression vector $\overset { \mathbf { v } } { \underset { \mathbf { \theta } } { X } } \in \mathbb { R } ^ { D }$ , where X _ { k } denotes the normalized expression value of gene $k$ in that cell, and $D$ is the total number of genes.
Process: The cell sentence for $X$ $X$ is constructed by:
1. Rank-ordering the genes within that cell based on their expression levels in descending order.
2. Selecting the top $K$ most highly expressed genes.
Formula: If $S$ $S$ is a list of indices from 1 to $D$ $D$ sorted in descending order based on expression level in $X$ $X$ , then the cell sentence transformation is defined as: $S ( X ) : = { \mathrm { } } ^ { * } \operatorname { g e n e } ( S [ 1 ] ) \operatorname { g e n e } ( S [ 2 ] ) \ldots \operatorname { g e n e } ( S [ K ] ) ^ { \mathfrak { P } }$ Where:
- S(X) represents the cell sentence generated from the expression vector $X$ .
- $S[1], S[2], \ldots, S[K]$ are the indices of the genes corresponding to the first, second, ..., $K$ -th most highly expressed genes in cell $X$ .
- $\operatorname{gene}(\cdot)$ is a function that maps a gene index to its official gene name (e.g., "TP53," "CD3D").
- ${^{\mathrm{s}}}$ and ${^{\mathfrak{P}}}$ are special start and end tokens for the cell sentence, indicating its boundaries to the LLM.
- The paper notes that this transformation is reversible with minimal information loss, meaning the original gene expression values can be recovered from the rank order with high accuracy (Figure 9 illustrates this with an $R^2$ of 85% for a linear model predicting original expression from rank).
Benefits: This transformation offers two primary benefits:
1. LLM Compatibility: It allows immediate application of any LLM architecture without requiring custom modifications.
2. Biological Meaning Retention: The rank-ordering process naturally captures crucial relative expression information, which LLMs can then learn to represent.

4.2.3. LLM Architecture and Components

C2S-Scale utilizes LLMs based on the Transformer architecture [8], specifically decoder-only Transformer architectures, which are well-suited for causal language modeling and generative tasks. The base architectures used are Gemma-2 [15] and Pythia [16].

Key Architectural Components:
- Word Embedding: LLMs represent input sequences as high-dimensional vectors (word embeddings). In C2S-Scale, each gene name in a cell sentence is treated as a token and converted into an embedding. This is achieved by the LLM's tokenizer and an embedding layer trained alongside the model. These embeddings capture the semantic information of genes, informed by both biological context and learned co-occurrence patterns.
- Attention Mechanism: The attention mechanism (specifically self-attention [8]) is central. It allows the model to dynamically weigh the importance of different genes within a cell sentence or across multiple cell sentences and contextual tokens. This enables the model to identify long-range dependencies in gene expression and how genes relate to specific biological tasks (e.g., key marker genes for cell type prediction, perturbation-associated genes for response prediction).
- Feedforward Networks: Each attention layer is followed by feedforward networks that apply non-linear transformations to enhance feature extraction.
- Residual Connections and Layer Normalization: These components are critical for stabilizing training and facilitating gradient flow, enabling effective scaling to large parameter sizes.

4.2.4. Pre-training (Two-Phase Approach)

The training process for C2S-Scale involves two-phase large-scale pre-training followed by additional fine-tuning for specific tasks.

Pre-training Objective: The primary pre-training objective for LLMs is next token prediction. This involves predicting the subsequent token in a sequence given the preceding tokens. In the context of cell sentences, this means predicting the next gene name in the rank-ordered sequence based on the previously seen genes. This objective forces the model to capture complex dependencies and relationships within the cell sentences.
Corpus: The multimodal corpus described in Section 4.2.1 is used. It integrates cell sentences with associated metadata and textual annotations.

Multi-Task Learning: C2S-Scale employs multi-task learning during pre-training. This means the model is exposed to a variety of task instructions and corresponding cell sentences or textual outputs. This allows the LLM to learn general representations and adapt to different scRNA-seq analysis tasks by following prompt instructions. The following table, adapted from the paper, illustrates the multi-task prompt formats used during pre-training to enable diverse learning:

Task name	Type	Input information	Target output	Metric
Single cell language modeling	Single-cell		Single cell sentence	Overlap %
Cell type annotation	Single-cell	Single cell sentence	Cell type	BertScore
Conditional cell generation	Single-cell	Cell type of one cell	Single cell sentence	Overlap %
Multiple cell language modeling	Multi-cell		Multiple cell sentences	Overlap %
Tissue sample annotation	Multi-cell	Multiple cell sentences	Tissue label	BertScore
Sample cell type(s) annotation	Multi-cell	Multiple cell sentences	Cell types of multiple cells	BertScore
Conditional sample generation (tissue)	Multi-cell	Tissue annotation	Multiple cell sentences	Overlap %
Conditional sample generation (cell type)	Multi-cell	Cell types of multiple cells	Multiple cell sentences	Overlap %
Conditional sample generation (abstract)	Multi-cell	Paper abstract	Multiple cell sentences	Overlap %
Natural language interpretation	Multi-cell	Multiple cell sentences	Paper abstract	BertScore
Gene set enumeration	Gene set	Gene set name	List of genes in gene set	Overlap %
Gene set naming	Gene set	List of genes in gene set	Gene set name	BertScore

Optimization: AdamW optimizer and gradient checkpointing are used to efficiently manage computational resources for models ranging from 1 billion to 27 billion parameters.
Tools: Huggingface [59] and PyTorch [60] are used for models up to 1B parameters, while Jax and TPU-based compute are employed for larger models (1B to 27B parameters).

4.2.5. Post-training Methods (Fine-tuning and Reinforcement Learning)

After pre-training, C2S-Scale models are further refined using Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

Supervised Fine-Tuning (SFT):
- Objective: To adapt the pre-trained model to specific downstream tasks or new datasets.
- Process: The model is fine-tuned using task-specific labeled data, often formatted with natural language prompts (e.g., "Predict the cell type of this cell sentence: [cell sentence]").
- LoRA (Low-Rank Adaptation) [61]: For parameter-efficient fine-tuning, LoRA is employed. This technique injects low-rank matrices into the Transformer layers, allowing adaptation to new tasks with significantly fewer trainable parameters compared to full fine-tuning. This makes fine-tuning more computationally efficient and reduces storage requirements for task-specific models.
Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) [17]:
- Objective: To further enhance performance on generative and interpretative tasks by aligning LLM outputs with human preferences or specific biological criteria.
- GRPO Process:
  1. Candidate Generation: The SFT model generates multiple candidate outputs (e.g., answers to a question, perturbation responses) for each training example.
  2. Reward Modeling/Ranking: Instead of human ranking, C2S-Scale leverages domain-specific criteria and automated metrics (e.g., BioBERTScore [25] for text quality, scFID for cell generation quality) to assess the quality of these candidate outputs. These metrics implicitly provide a ranking or preference signal.
  3. Policy Optimization: GRPO then fine-tunes the model to favor outputs that receive higher scores (i.e., higher-quality and more biologically aligned responses). It optimizes the model's policy to generate better responses.
- Advantages over other RL methods (e.g., PPO [62]): GRPO offers a more streamlined workflow as it doesn't require complex reward models or extensive human feedback; it can directly use automated metrics for ranking. This makes it particularly effective for domain-specific LLM applications.

4.2.6. Long-Context and Multi-Cell Capabilities

Context Length: C2S-Scale models support extended context lengths up to 8192 tokens.
Benefits: This allows for:
- Comprehensive Multimodal Input: Integration of diverse contextual information such as biological annotations, manuscript text, perturbation conditions, and detailed task-specific instructions alongside cell sentences.
- Multi-Cell Analysis: The model can process and generate data for multiple cells simultaneously. This is crucial for analyzing cellular interactions, spatial relationships, and complex biological processes that involve multicellular contexts.

4.2.7. Diverse Downstream Applications

The C2S-Scale framework is evaluated and fine-tuned on a broad range of challenging biological tasks:

Predictive Tasks:
- Cell type annotation: Predicting the cell type given a cell sentence.
- Perturbation response prediction: Predicting how gene expression changes in response to specific perturbations.
Generative Tasks:
- Conditional cell generation: Generating cell sentences (representing gene expression profiles) under specific conditions (e.g., a particular cell type).
- Cluster captioning: Generating natural language descriptions for groups of cells.
Natural Language Interpretation and Reasoning Tasks:
- Dataset interpretation: Summarizing the biological findings of an entire scRNA-seq dataset.
- Question Answering (QA): Answering complex biological questions based on scRNA-seq data.
- Spatial reasoning: Predicting spatial relationships and niches from multi-cell context.
- Gene set enumeration and naming: Identifying genes belonging to a set or naming a set based on its genes.
  
  This wide array of applications showcases the versatility and power of the LLM-based approach for single-cell analysis.

The following figure (Figure 2 from the original paper) summarizes the C2S-Scale framework, showing its multimodal input, cell sentence transformation, and diverse outputs:

该图像是示意图，展示了单细胞和大规模RNA测序数据的整合，包括生物注释和自然语言注释的关系。图中展示了不同细胞表达特征和细胞句子生成的流程，并阐述了应用于细胞类型预测、扰动响应预测、条件细胞生成及问答的模型输出。

Figure 2: C2S-Scale bridges scRNA-seq data and natural language by training LLMs to perform single-cell analysis tasks on diverse multimodal data. (A) A multimodal corpus of over 50 million human and mouse transcriptomes is papers, gene sets, and disease label from scRNA-seq studies. (B) C2S rank-orders genes by expression and converts them to natural language "cell sentences", leveraging powerful LLM architectures without the need for custom Transformer models. The output models are then fine-tuned for diverse downstream tasks, including cell type prediction, perturbation prediction, generative tasks, and advanced biological reasoning tasks such as question answering.

5. Experimental Setup

5.1. Datasets

C2S-Scale utilizes a massive and diverse collection of datasets for both pre-training and fine-tuning/evaluation.

Pre-training Corpus

Composition: Over 50 million single-cell transcriptomic profiles from human and mouse tissues.
Sources: Primarily gathered from publicly available single-cell atlases, including CellxGene [2] and the Human Cell Atlas [3].
Multimodality: This corpus is rich, encompassing:
- Transcriptomic data: The scRNA-seq expression profiles themselves.
- Associated annotations: Cell type labels, tissue labels, experimental conditions.
- Metadata: Information about the samples, donors, experimental setup.
- Biological text: Paper abstracts and other free-text descriptions related to the studies.
Scale: This multimodal data is tokenized to form a corpus of over 1 billion tokens.

Downstream Task-Specific Datasets

Cell Type Annotation:
- An immune tissue dataset [63] and a lung dataset [19].
- For each, 80% of cells were used for training and 20% for evaluation.
Multi-Cell Integration:
- To evaluate multicellular context and spatial reasoning, the CosMx Spatial Molecular Imager Human Liver dataset [35] was used. This dataset provides spatially-resolved single-cell data from both normal and tumor liver tissues across different donors, encompassing over 800,000 single cells. Cells with fewer than three cells or expressing fewer than 50 genes were filtered out. The data was normalized. Spatial coordinates were embedded using UMAP [64] or PHATE [65].
- External biological knowledge for spatial reasoning: CellPhoneDB [36] (receptor-ligand interactions) and BioGRID [37] (protein-protein interaction data). Data was restricted to interactions involving the 1,000 genes in the CosMx dataset and to extracellular proteins.
Question Answering (QA):
- A custom QA dataset was generated using GPT-4.5 [22]. For each scRNA-seq study, GPT-4.5 was prompted to generate meaningful question-answer pairs from three sections of the manuscript (abstract, discussions, results) and sampled data from the study. Each study yielded 100 QA pairs.
Perturbation Response Prediction:
- The Dong et al. dataset [47]: Contains immune cells exposed to individual and combinatorial cytokines, with 133 conditions. Gene expression changes in response to these perturbations are the target for prediction.
- The L1000 dataset [46]: A connectivity map dataset used for gene expression profiling in response to various chemical and genetic perturbations.
Cluster Captioning and Dataset Interpretation:
- For cluster captioning, 30 scRNA-seq datasets were used, and natural language captions for cell clusters were generated using GPT-4.5 [22].
- For dataset interpretation, the model was evaluated on scRNA-seq data and paper abstracts from 613 CellxGene datasets (in-distribution) and completely unseen datasets (out-of-distribution). The goal was to generate high-level summaries of scRNA-seq datasets from cell sentences.
  
  The following figure (Figure 10 from the original paper) provides concrete examples of abstract summaries generated from scRNA-seq datasets, giving an intuitive sense of the textual data used for interpretation tasks:

Figure 10: Example abstract summaries from scRNA-seq datasets collected from CellxGene [2].

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a detailed explanation is provided:

Overlap %

Conceptual Definition: This metric quantifies the percentage of tokens (gene names) that are common between a generated cell sentence or gene list and the ground-truth cell sentence or gene list. It is used primarily for generative tasks where the output is a sequence of genes (e.g., conditional cell generation, single cell language modeling, gene set enumeration). A higher Overlap % indicates better generative fidelity.
Mathematical Formula: $ \text{Overlap %} = \frac{|\text{Generated Genes} \cap \text{Ground Truth Genes}|}{|\text{Ground Truth Genes}|} \times 100 $ Where:
- $\text{Generated Genes}$ is the set of genes in the model's output.
- $\text{Ground Truth Genes}$ is the set of genes in the reference (actual) output.
- $|\cdot|$ denotes the cardinality (number of elements) of a set.

BERTScore [25]

Conceptual Definition: BERTScore is a metric for evaluating the quality of text generation by comparing candidate sentences to reference sentences. Unlike traditional metrics like BLEU or ROUGE that rely on exact word matches, BERTScore leverages contextual embeddings from BERT models to measure semantic similarity. It calculates precision, recall, and F1 scores based on the cosine similarity between embeddings of tokens in the candidate and reference sentences. A higher BERTScore indicates greater semantic similarity between generated and human-written text. The paper uses BioBERTScore, which is implicitly a version of BERTScore adapted or fine-tuned for biological text.
Mathematical Formula: BERTScore computes precision (P), recall (R), and F1-score (F1). For a candidate sentence $c$ $c$ and a reference sentence $r$ $r$ , BERTScore is calculated as: $ P = \frac{1}{\sum_{c_i \in c} 1} \sum_{c_i \in c} \max_{r_j \in r} \left( \text{cosine_similarity}(\text{embedding}(c_i), \text{embedding}(r_j)) \right) $ $ R = \frac{1}{\sum_{r_j \in r} 1} \sum_{r_j \in r} \max_{c_i \in c} \left( \text{cosine_similarity}(\text{embedding}(r_j), \text{embedding}(c_i)) \right) $ $ F1 = 2 \times \frac{P \times R}{P + R} $ Where:
- $c_i$ represents the $i$ -th token in the candidate sentence.
- $r_j$ represents the $j$ -th token in the reference sentence.
- $\text{embedding}(\cdot)$ is a function that maps a token to its contextual BERT embedding.
- $\text{cosine\_similarity}(v_1, v_2) = \frac{v_1 \cdot v_2}{\|v_1\| \|v_2\|}$ measures the cosine of the angle between two embedding vectors.
- The max operation indicates that each token in one sentence is compared to all tokens in the other sentence, and the highest similarity score is taken.

Maximum Mean Discrepancy (MMD)

Conceptual Definition: MMD is a statistical test used to determine if two samples are drawn from the same distribution, often applied to evaluate the similarity between the distributions of real and generated data. In the context of single-cell data, it measures the "distance" between the distribution of real cell embeddings and the distribution of generated cell embeddings. A lower MMD score indicates that the generated data's distribution is closer to the real data's distribution.
Mathematical Formula: MMD is defined using a kernel function k(x, y). For two samples $X = \{x_1, \ldots, x_n\}$ $X = {x_{1}, \dots, x_{n}}$ and $Y = \{y_1, \ldots, y_m\}$ $Y = {y_{1}, \dots, y_{m}}$ from distributions $P$ $P$ and $Q$ $Q$ respectively, the empirical MMD squared is: $ \mathrm{MMD}^2(X, Y) = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n k(x_i, x_j) + \frac{1}{m^2} \sum_{i=1}^m \sum_{j=1}^m k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^n \sum_{j=1}^m k(x_i, y_j) $ Where:
- $X$ represents the set of real cell embeddings.
- $Y$ represents the set of generated cell embeddings.
- $n$ and $m$ are the number of samples in $X$ and $Y$ , respectively.
- k(x, y) is a positive definite kernel function (e.g., Gaussian kernel) that implicitly maps data into a Reproducing Kernel Hilbert Space (RKHS).

Wasserstein Distance

Conceptual Definition: Also known as Earth Mover's Distance (EMD), the Wasserstein distance measures the minimum "cost" of transforming one distribution into another. Intuitively, it's the minimum amount of "work" required to move a pile of dirt (one distribution) to match the shape of another pile of dirt (the other distribution). In single-cell analysis, it quantifies how much two distributions of cell embeddings (real vs. generated) differ. A smaller Wasserstein distance implies that the generated data distribution is very similar to the real data distribution.
Mathematical Formula: For two probability distributions $\mu$ $μ$ and $\nu$ $ν$ on a metric space (M, d), the Wasserstein distance of order $p \ge 1$ $p \geq 1$ is defined as: $ W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{M \times M} d(x, y)^p , d\gamma(x, y) \right)^{1/p} $ Where:
- $\mu$ is the distribution of real cell embeddings.
- $\nu$ is the distribution of generated cell embeddings.
- $\Gamma(\mu, \nu)$ is the set of all joint distributions $\gamma(x, y)$ whose marginals are $\mu$ and $\nu$ on the first and second components, respectively.
- d(x, y) is the distance metric between points $x$ and $y$ in the embedding space (e.g., Euclidean distance).
- The infimum is taken over all possible ways to "transport" mass from $\mu$ to $\nu$ .

Single-Cell Fréchet Inception Distance (scFID)

Conceptual Definition: scFID is a novel metric introduced in this paper to evaluate the generative quality of single-cell models. It is an adaptation of the Fréchet Inception Distance (FID) [48] used in image generation. Instead of using features from Inception-v3 (an image classification model), scFID leverages embeddings from a single-cell foundation model (like scGPT) to compare the distributions of real and generated cell embeddings in a biologically meaningful feature space. A lower scFID score indicates that the generated cells are more biologically realistic and similar to real cells.
Mathematical Formula: Given two sets of single-cell embeddings—one from real cells ( $r$ $r$ ) and one from generated cells ( $g$ $g$ )—scFID is defined as: $\mathrm { s c F I D } = \Vert \mu _ { r } - \mu _ { g } \Vert _ { 2 } ^ { 2 } + \operatorname { t r } \left( \Sigma _ { r } + \Sigma _ { g } - 2 \left( \Sigma _ { r } \Sigma _ { g } \right) ^ { \frac { 1 } { 2 } } \right)$ Where:
- $\mu _ { r }$ and $\mu _ { g }$ are the mean vectors of the real and generated cell embeddings, respectively. These capture the average position of the data cloud.
- $\Sigma _ { r }$ and $\Sigma _ { g }$ are the covariance matrices of the real and generated cell embeddings, respectively. These capture the spread and correlation structure of the data.
- $\Vert \cdot \Vert _ { 2 } ^ { 2 }$ denotes the squared Euclidean distance (L2 norm) between the mean vectors.
- $\operatorname { t r } (\cdot)$ denotes the trace of a matrix (sum of its diagonal elements).
- $(\Sigma _ { r } \Sigma _ { g }) ^ { \frac { 1 } { 2 } }$ refers to the matrix square root of the product of the covariance matrices.

Kendall's $\tau$ (Tau)

Conceptual Definition: Kendall's\tau

is a non-parametric statistic that measures the ordinal association between two ranked lists. It quantifies the similarity of the orderings of data when ranked by quantities. In `perturbation response prediction`, it can be used to compare the rank-ordering of gene expression changes (e.g., genes up- or down-regulated) between predicted and actual responses. A higher value (closer to 1) indicates stronger agreement in rank order.
*   **Mathematical Formula:** For two paired observations  $(x_1, y_1), \ldots, (x_n, y_n)$ , `Kendall's`\tau

is calculated as: $ \tau = \frac{N_c - N_d}{\frac{1}{2}n(n-1)} $ Where: * $N_c$ is the number of concordant pairs: pairs where the relative order of $x$ and $y$ is the same. For any two pairs $(x_i, y_i)$ and $(x_j, y_j)$ with $i \ne j$ , if $x_i < x_j$ and $y_i < y_j$ , or if $x_i > x_j$ and $y_i > y_j$ , they are concordant. * $N_d$ is the number of discordant pairs: pairs where the relative order of $x$ and $y$ is different. If $x_i < x_j$ and $y_i > y_j$ , or if $x_i > x_j$ and $y_i < y_j$ , they are discordant. * $\frac{1}{2}n(n-1)$ is the total number of distinct pairs.

Pearson's $r$ (Pearson Correlation Coefficient)

Conceptual Definition: Pearson's $r$ measures the linear correlation between two sets of data. It is a measure of the strength and direction of a linear relationship between two variables. In perturbation response prediction, it would quantify the linear relationship between predicted gene expression values and actual gene expression values after a perturbation. Values range from -1 (perfect negative linear correlation) to 1 (perfect positive linear correlation), with 0 indicating no linear correlation.
Mathematical Formula: $ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} $ Where:
- $x_i$ and $y_i$ are individual data points (e.g., predicted and actual gene expression values for gene $i$ ).
- $\bar{x}$ and $\bar{y}$ are the mean values of the respective datasets.
- $n$ is the number of data points.

5.3. Baselines

C2S-Scale is compared against a comprehensive set of baselines, including specialized single-cell models, general-purpose LLMs, and other perturbation prediction models.

Specialized Single-Cell Models

scGPT [4]: A transcriptomic foundation model that represents genes as tokens and uses a masked gene modeling objective. It's a key benchmark for scFMs.
Geneformer [5]: Another scFM that rank-orders genes and uses a masked modeling objective.
scGen: A deep learning model designed for single-cell perturbation prediction and data integration [69].
CellOT: A deep learning model that uses optimal transport for modeling perturbation effects in single-cell data [47].

General-Purpose Large Language Models (LLMs)

Llama [20, 21, 26]: A family of open-source LLMs from Meta, known for strong performance across various NLP tasks.
GPT-4o [22] (and GPT-4.5): Advanced LLMs from OpenAI, recognized for their high capabilities in natural language understanding and generation. GPT-4.5 was also used for QA dataset generation.
Gemini [23]: A family of multimodal LLMs from Google, offering strong performance in reasoning and generation.
Meditron [27]: An open-source LLM specifically adapted for clinical practice in medical domains.
BioMistral [28]: A collection of open-source LLMs pre-trained for medical domains, demonstrating adaptation of LLMs to biological/medical text.

These baselines represent the state-of-the-art in both specialized single-cell analysis and general LLM capabilities, allowing for a thorough evaluation of C2S-Scale's performance and generalization across diverse tasks.

6. Results & Analysis

6.1. Core Results Analysis

The results presented demonstrate C2S-Scale's superior performance, scalability, and versatility across a wide array of single-cell analysis tasks, often outperforming both specialized scFMs and general-purpose LLMs.

6.1.1. Performance Scaling Laws

The paper establishes scaling laws for LLMs in single-cell analysis, showing consistent improvements with increasing model size and data.

Model Capacity: As model size increased from 410 million to 27 billion parameters (using Gemma-2 and Pythia architectures), C2S-Scale demonstrated significant improvements in both predictive and generative tasks. This indicates that more parameters allow the model to capture more complex biological relationships.
Data Size: Performance also scaled positively with the number of training samples seen by the model.
Parameter-Efficient Regimes: The improvements were observed even in parameter-efficient regimes (e.g., using LoRA for fine-tuning), highlighting the practical utility of scaling even with limited computational resources.

The following figure (Figure 4 from the original paper) illustrates the performance scaling of C2S-Scale models:

该图像是示意图，展示了基于Cell2Sentence（C2S）框架的单细胞RNA测序数据建模流程。图中左侧描述了细胞句子生成及其任务，包括细胞类型识别和生物摘要，要点涉及不同细胞样本和对应的基因表达信息。右侧展示了样本生成的效果，包括来自不同组织条件与数据集摘要的生成。中部的图表（C、D、E）展示了不同参数数量的模型在BERT得分上的表现，反映了模型在多任务中的潜力。

Figure 4: C2S-Scale integrates consistent scaling performance across a variety of single-cell analysis tasks. (A) C2S generates gene expression profiles and natural language prompts and responses for tasks in (A), colored by expression generation (red), predictive (blue), and language generation (green) tasks. (C) Performance scaling of fine-tuned C2S models on conditional sample generation, cell type annotation, tissue sample annotation, and dataset interpretation. (D) LoRA-finetuned C2S-Scale-2B and 27B models demonstrate performance scaling with increased mode capacity in the parameter-efficient regime. (E) Performance scaling by number of training samples seen by C2S-Scale-27B.

6.1.2. State-of-the-Art Predictive and Generative Capabilities

C2S-Scale achieves state-of-the-art results across several fundamental single-cell analysis tasks.

Cell Embeddings: C2S-Scale generates rich cell embeddings that consistently capture more biologically meaningful representations compared to other models, even without re-training on bulk data. This is attributed to the cell sentence transformation effectively preserving crucial biological information.
Generative Tasks: Unlike most other transcriptomic foundation models that require architectural modifications for generative tasks, C2S-Scale excels in generative tasks (e.g., perturbation response prediction, conditional cell generation) without such changes, leveraging its LLM architecture.
Outperformance of SOTA LLMs: For tasks involving reasoning about scRNA-seq data (e.g., cluster captioning, dataset interpretation, question answering), C2S-Scale consistently outperforms leading general-purpose LLMs like Llama, GPT-4o, and Gemini.
Generalization to Unseen Data: A key finding is C2S-Scale's ability to generalize effectively to completely unseen scRNA-seq datasets, producing relevant and informative summaries and predictions. This highlights its robust natural language understanding of scRNA-seq data.

6.1.3. Natural Language Interpretation at Multiple Scales of Biology

C2S-Scale demonstrates unique capabilities in natural language interpretation of scRNA-seq data, bridging raw transcriptomics with biological literature.

Cluster Captioning: The model can automatically generate meaningful natural language captions for groups of cells from the same tissue and batch, achieving high BioBERTScore similarity with ground-truth captions even on unseen data clusters.
Dataset Interpretation: At a broader scale, C2S-Scale can interpret entire scRNA-seq datasets and generate high-level summaries. It achieves the highest BERTScore among all evaluated models (including LLaMA, Meditron, BioMistral, Gemini, GPT-4o) for this task, even on entirely unseen datasets. This capability allows researchers to gain biologically meaningful insights in natural language.

The following figure (Figure 5 from the original paper) showcases C2S-Scale's natural language interpretation abilities at multiple scales:

该图像是一个示意图，展示了生物学的不同层级以及与单细胞分析相关的多种实验结果，包括细胞类型注释、聚类标题和数据集解释等。在 B 部分，展示了真实细胞类型与预测细胞类型之间的准确性对比。C、D、E 部分则展示了不同大型语言模型在特定任务中的表现，用 BERT 分数进行比较。

Figure 5: C2S-Scale enables natural language interpretation of scRNA-seq data at multiple scales of biology. (A) Illustration of natural language interpretation at cell type, cluster, and dataset levels. (B) Ground-truth and predicted cell types for cells extracted from 6 different tissues across multiple human tumors [18]. C2S-Scale achieves high accuracy in this task. (C) Performance of C2S-Scale and baseline models on unseen scRNA-seq data clusters. Models are given multi-cell context from unseen data clusters and asked to caption them. Performance is measured by BioBERTScore. (D) C2S-Scale outperforms baselines for dataset interpretation on unseen abstracts for dataset interpretation on unseen abstracts from unseen scRNA-seq datasets. Error bars represent standard deviation across test set samples.

6.1.4. Spatial Reasoning from Multi-cell Context and Interaction Data

C2S-Scale learns spatial organization and cellular interactions without requiring architectural modifications, simply by processing multi-cell context.

Neighborhood Prediction: By sampling and encoding cells from shared neighborhoods, C2S-Scale can infer spatial relationships. It significantly outperforms scGPT and GPT-4o in neighborhood prediction tasks on spatially-resolved scRNA-seq data (CosMx dataset).
Integration of External Biological Knowledge: The model effectively incorporates external biological knowledge, specifically gene interaction networks from CellPhoneDB (receptor-ligand interactions) and BioGRID (protein-protein interaction data). This integration, without predefining rules, further enhances spatial reasoning and niche prediction performance. C2S-Scale applies this knowledge dynamically, demonstrating improved performance when these data sources are added individually or combined.

The following figure (Figure 6 from the original paper) details C2S-Scale's spatial reasoning capabilities:

$该图像是示意图，展示了用于单细胞分析的不同任务，包括细胞生态位预测、邻居生成和邻里预测等。此外，B部分列出了基因表达和蛋白质相互作用的数据库，C部分则表现了空间预测的准确性比较。公式为 $Spatial\\,Prediction\\,Accuracy$。$ 该图像是示意图，展示了用于单细胞分析的不同任务，包括细胞生态位预测、邻居生成和邻里预测等。此外，B部分列出了基因表达和蛋白质相互作用的数据库，C部分则表现了空间预测的准确性比较。公式为 $Spatial\,Prediction\,Accuracy$ 。

Figure 6: C2S-Scale can interpret multi-cell spatial context and predict niche neighborhoods. (A) We train C2S-Scale on a variety of single-cell and multi-cellular spatial tasks designed to enable C2S-Scale to perform spatial reasoning. Tasks include niche label prediction, conditional neighbor generation, and spatial neighborhood prediction. (B) We include publicly available receptor-ligand and protein-protein interaction data from CellPhoneDB and BioGRID, restricted to extracellular proteins in the CosMx dataset. (C) C2S outperforms scGPT and GPT-4o in spatial neighborhood identification accuracy. Additionally, integrating gene interactions from BioGRID and CellPhoneDB individually improves performance, and their combination provides the greatest improvement. The results highlight the multi-task transfer potential of C2S-Scale for spatially-aware biological modeling.

6.1.5. Single-Cell Question Answering (QA) through Reinforcement Learning

C2S-Scale sets a new standard for single-cell QA, particularly with the aid of Reinforcement Learning.

Outperformance of SOTA LLMs: C2S-Scale demonstrates superior single-cell question answering performance compared to state-of-the-art LLMs like Llama, GPT-4o, and Gemini.
GRPO Enhancement: Group Relative Policy Optimization (GRPO) significantly improves C2S-Scale's QA capabilities. By using BioBERTScore as a reward function, GRPO guides the model to produce higher-quality, biologically aligned answers, substantially outperforming the SFT-based model.

The following figure (Figure 7 from the original paper) illustrates C2S-Scale's QA performance:

该图像是示意图，展示了 C2S 框架用于单细胞 RNA 测序数据的分析。图 A 提供了背景及示例细胞句子，图 B 描述了组相对策略优化的过程，图 C 则比较了 C2S-Scale 与其他语言模型的表现。

Figure 7: C2S-Scale demonstrates superior single-cell question answering performance compared to state-of-the-art (SOTA) LLMs. (A) Example QA scenario based on scRNA-seq data. (B) Overview of the GRPO framework [17], where the LLM is trained to improve predicted responses based on a reward model. (C) Comparison of C2S-Scale vs. SOTA LLMs on single-cell QA tasks, highlighting C2S-Scale's advantage in domain-specific reasoning. Error bars represent standard deviation across test set QA samples.

6.1.6. Perturbation Response Prediction

C2S-Scale accurately predicts cellular responses to perturbations, outperforming existing specialized methods.

Superior Accuracy: C2S-Scale accurately predicts conditional responses to perturbations across entire gene expression profiles, surpassing scGen, CellOT, and scGPT in metrics like MMD, Wasserstein distance, and scFID.
Generalization to Unseen Perturbations: The model effectively generalizes to completely unseen combinatorial perturbations and various cell types/cytokine combinations, capturing non-linear synergies.
GRPO Refinement: GRPO further refines perturbation response generation, particularly for specific gene programs of interest (e.g., apoptosis in L1000 data, interferon-related genes in cytokine stimulation). This leads to notable improvements in Kendall's\tau

and `Pearson's  $r$ `, enhancing the biological fidelity and generalization to out-of-distribution settings.

    The following figure (Figure 8 from the original paper) details the `perturbation prediction` results:

    ![该图像是图8，展示了C2S-Scale模型在预测细胞对未见扰动响应方面超越现有方法的结果。图中包含多个细胞模型的对比，包括Ground Truth、C2S、scGen、scGPT和CellOT，指标评分如scFID和Wasserstein Score等被用于评估模型性能。](/files/papers/6915ae034d6b2ff314a02e95/images/8.jpg)
    *该图像是图8，展示了C2S-Scale模型在预测细胞对未见扰动响应方面超越现有方法的结果。图中包含多个细胞模型的对比，包括Ground Truth、C2S、scGen、scGPT和CellOT，指标评分如scFID和Wasserstein Score等被用于评估模型性能。*

Figure 8: Overview of the `C2S-Scale perturbation prediction framework`, which supports diverse `perturbation types`. (A) `C2S-Scale` excels at `perturbation prediction`, outperforming all existing methods. (B) `C2S-Scale` aligns `predicted responses` with `ground truth` in unseen `combinatorial perturbations` (interferon- $\beta$  + `IL-6`). `C2S-Scale` outperforms `scGPT` and `CellOT` in `perturbation prediction` for `CD4 T cells` and `B cells`. (C) `Prompt` and `response example` for `perturbation prediction`. (D) `APCs` comparing `predicted vs. ground-truth responses` for unseen `perturbations` across four models. Rows show: (1) all `combinatorial perturbations`, (2) `CD4 T cells` under `IFN-`\gamma

, (3) B cells under the held-out IFN-\beta+ IL-6 stimulation. C2S-Scale aligns closely with ground truth in all cases. (E) Benchmark metrics show C2S-Scale outperforms scGen, scGPT, and CellOT across several metrics (MMD, Wasserstein, scFID) for perturbation prediction for a large number of held-out perturbations. (F) Overview of the GRPO framework for perturbation prediction, where the LLM is trained to improve predicted responses and receive rewards based on gene program similarity. (G) GRPO improves over SFT on L1000 (apoptosis response) and cytokine stimulation (interferon response) tasks, with gains in Kendall's\tau

, `Pearson's r`, and `scFID`.

## 6.2. Data Presentation (Tables)
The following are the results from `Table 1` of the original paper, showing the `MultiTask Prompt Formats` used in `C2S-Scale` `pre-training`:

<div class="table-wrapper"><table>
<thead>
<tr>
<th>Task name</th>
<th>Type</th>
<th>Input information</th>
<th>Target output</th>
<th>Metric</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single cell language modeling</td>
<td>Single-cell</td>
<td></td>
<td>Single cell sentence</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Cell type annotation</td>
<td>Single-cell</td>
<td>Single cell sentence</td>
<td>Cell type</td>
<td>BertScore</td>
</tr>
<tr>
<td>Conditional cell generation</td>
<td>Single-cell</td>
<td>Cell type of one cell</td>
<td>Single cell sentence</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Multiple cell language modeling</td>
<td>Multi-cell</td>
<td></td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Tissue sample annotation</td>
<td>Multi-cell</td>
<td>Multiple cell sentences</td>
<td>Tissue label</td>
<td>BertScore</td>
</tr>
<tr>
<td>Sample cell type(s) annotation</td>
<td>Multi-cell</td>
<td>Multiple cell sentences</td>
<td>Cell types of multiple cells</td>
<td>BertScore</td>
</tr>
<tr>
<td>Conditional sample generation (tissue)</td>
<td>Multi-cell</td>
<td>Tissue annotation</td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Conditional sample generation (cell type)</td>
<td>Multi-cell</td>
<td>Cell types of multiple cells</td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Conditional sample generation (abstract)</td>
<td>Multi-cell</td>
<td>Paper abstract</td>
<td>Multiple cell sentences</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Natural language interpretation</td>
<td>Multi-cell</td>
<td>Multiple cell sentences</td>
<td>Paper abstract</td>
<td>BertScore</td>
</tr>
<tr>
<td>Gene set enumeration</td>
<td>Gene set</td>
<td>Gene set name</td>
<td>List of genes in gene set</td>
<td>Overlap %</td>
</tr>
<tr>
<td>Gene set naming</td>
<td>Gene set</td>
<td>List of genes in gene set</td>
<td>Gene set name</td>
<td>BertScore</td>
</tr>
</tbody>
</table></div>

## 6.3. Ablation Studies / Parameter Analysis
The paper primarily explores the impact of `model capacity` and `training data size` as forms of `ablation` or `parameter analysis` on `C2S-Scale`'s performance.
*   **Model Capacity Scaling (Figure 4C, 4D):**
    *   `C2S-Scale` models, ranging from 410M to 27B parameters, consistently show improved performance on tasks like `conditional sample generation`, `cell type annotation`, `tissue sample annotation`, and `dataset interpretation` as the number of parameters increases. This validates the `scaling laws` for `LLMs` in this domain.
    *   Even in the `parameter-efficient regime` (using `LoRA`), larger base models (e.g., `C2S-Scale-27B` vs `C2S-Scale-2B`) demonstrate superior performance, indicating that the benefits of increased model capacity persist even with efficient `fine-tuning` techniques.
*   **Training Data Size Scaling (Figure 4E):**
    *   Performance continues to improve as the `C2S-Scale-27B` model sees more training samples. This highlights the importance of the massive 1-billion token multimodal corpus for `C2S-Scale`'s capabilities.
*   **Impact of Reinforcement Learning (GRPO) (Figure 7C, 8G):**
    *   For `single-cell Question Answering`, `GRPO` significantly boosts performance compared to the `SFT` baseline and other `SOTA LLMs`, demonstrating the effectiveness of `RL` in aligning `LLM` outputs with biological preferences.
    *   In `perturbation response prediction`, `GRPO` leads to notable improvements in `Kendall's`\tau

, Pearson's $r$ , and scFID, particularly for gene programs of interest (e.g., apoptosis, interferon response). This confirms that GRPO refines the model's ability to generate biologically faithful and relevant perturbation responses.

External Biological Knowledge for Spatial Reasoning (Figure 6C):
- The paper shows that incorporating receptor-ligand interactions (CellPhoneDB) or protein-protein interactions (BioGRID) individually improves spatial neighborhood identification accuracy. Combining both datasets yields the greatest performance gain, confirming the hypothesis that external molecular context enhances spatial reasoning in C2S-Scale.
  
  These analyses collectively underscore that C2S-Scale benefits substantially from increases in model capacity, training data size, and targeted refinement through advanced techniques like GRPO and external knowledge integration.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces C2S-Scale, a pioneering framework that leverages Large Language Models (LLMs) for next-generation single-cell analysis. By building upon the $Cell2Sentence (C2S)$ transformation, which represents scRNA-seq profiles as textual "cell sentences," the authors have successfully integrated transcriptomic data with biological text and metadata at an unprecedented scale. C2S-Scale models, scaled up to 27 billion parameters, demonstrate consistent performance improvements following LLM scaling laws, proving highly effective in both predictive and generative tasks.

The key contributions include:

Scalability and Multimodality: C2S-Scale offers superior scalability and flexibility by operating within the LLM paradigm, natively integrating diverse data types (transcriptomics, biological text, metadata) into a unified framework.
State-of-the-Art Performance: It achieves state-of-the-art results across a broad range of challenging tasks, including perturbation response prediction, natural language interpretation, spatial reasoning, and complex biological question answering, consistently outperforming both specialized single-cell models and general-purpose LLMs.
Reinforcement Learning for Alignment: The application of Group Relative Policy Optimization (GRPO) significantly refines model outputs, aligning them with biological insights and expert preferences.
Novel Evaluation Metric: The introduction of single-cell Fréchet Inception Distance (scFID) provides a more biologically meaningful way to evaluate generative single-cell models.
Open Science: The commitment to open-sourcing models and resources will accelerate future research in the field.

Ultimately, C2S-Scale establishes a powerful platform for next-generation single-cell analysis, moving towards the exciting prospect of "virtual cells" capable of complex biological reasoning and simulation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

7.2.1. Addressing Limitations of Causal Attention in Gene Expression Modeling

Causal Attention Constraint: LLMs typically use causal attention, meaning tokens can only attend to preceding tokens in a sequence. While the cell sentence transformation rank-orders genes by expression (high-to-low), this inherent causal attention bias might theoretically limit the modeling of true causal biological interactions that flow from low- to high-expression genes (e.g., a lowly expressed transcription factor regulating a highly expressed target gene).
Authors' Argument: They contend that this constraint does not significantly impede current utility. They suggest that the LLM's general reasoning capabilities can compensate for causal attention limitations, similar to how LLMs in NLP successfully model statistical relations despite word order not perfectly aligning with causal dependencies in language.
Future Architectural Enhancements: The authors propose three architectural enhancements to mitigate this causal invariance and improve biological fidelity:
1. Bidirectional attention: Allowing tokens to attend to both preceding and succeeding tokens.
2. Non-causal attention: Removing the directional constraint entirely.
3. Hybrid attention architectures: Blending causal and non-causal attention to selectively model dependencies.
Incorporating Biological Priors: They also suggest integrating external biological knowledge (e.g., known regulatory pathways) directly into the model to provide additional context beyond what causal attention can infer.

7.2.2. Hallucination and Interpretability

Hallucination: Like all LLMs, C2S-Scale is susceptible to hallucinations (generating plausible but incorrect or nonsensical information). While less problematic for highly structured tasks like cell type prediction, more open-ended interpretation tasks (e.g., abstract generation, cluster captioning) are more vulnerable to errors.
Interpretability: The inherent "black-box" nature of LLMs makes it challenging to fully understand why the model makes certain predictions or generates specific text.
Mitigation: The authors suggest RL and prompt engineering as avenues to improve interpretability and reduce hallucinations by aligning the model with desired output characteristics and expert feedback.

7.3. Personal Insights & Critique

The C2S-Scale paper presents a highly innovative and compelling approach to single-cell analysis, effectively bridging the gap between high-dimensional genomic data and natural language.

Strengths and Innovations:

Brilliant Transformation: The Cell2Sentence transformation is a stroke of genius. It's a simple yet powerful abstraction that allows the entire LLM ecosystem to be immediately applicable to a complex biological domain. This sidesteps the need for custom Transformer designs for scRNA-seq, leveraging decades of NLP research and readily available LLM architectures.
Leveraging LLM Scaling Laws: The paper's empirical validation of LLM scaling laws in the single-cell domain is a significant finding. It suggests that continued investment in larger models and datasets will yield predictable performance gains, a crucial insight for future research.
Multimodality is Key: The native integration of transcriptomics, metadata, and biological text is transformative. It allows C2S-Scale to operate at a higher level of biological reasoning, moving beyond pure pattern recognition to contextual understanding. This capacity for natural language interpretation is particularly exciting for making complex biological data more accessible to a broader scientific community.
Reinforcement Learning for Domain Alignment: The strategic use of GRPO to fine-tune LLM outputs against domain-specific metrics (like BioBERTScore and scFID) is a robust way to ensure biological fidelity and address potential hallucinations or misinterpretations. This active alignment process is crucial for deploying LLMs in high-stakes scientific applications.
The "Virtual Cell" Vision: The concept of "virtual cells" is ambitious and inspiring. C2S-Scale lays a foundational stone for creating models that can not only predict but also reason about and simulate cellular behavior, accelerating drug discovery and personalized medicine.

Potential Issues and Areas for Improvement:

Causal Attention Limitation (as acknowledged): While the authors address the causal attention limitation, its true impact on modeling complex gene regulatory networks (where low-expression regulators often dictate high-expression targets) needs more thorough investigation. The proposed architectural enhancements are promising but require empirical validation. Relying solely on LLM reasoning capabilities to compensate might be an optimistic assumption for precise causal inference.
Interpretability Challenge (as acknowledged): Despite the advances, the "black-box" nature of LLMs remains a challenge. For biologists, understanding why a prediction is made is often as important as the prediction itself. Future work on explainable AI (XAI) for C2S-Scale would be invaluable to build trust and facilitate biological discovery.
Data Bias and Generalizability: The training data, while massive, is still a representation of existing biological knowledge. Biases present in public scRNA-seq atlases or biological literature could propagate into C2S-Scale's representations and reasoning. Continuous auditing and expansion with diverse, unbiased datasets will be necessary.
Computational Cost: Training and deploying 27B parameter models, even with LoRA and gradient checkpointing, is computationally intensive. While scaling laws promise benefits, accessibility for smaller labs might remain a barrier. Open-sourcing helps, but resource requirements are still high.
Specificity of "Cell Sentences": While rank-ordering preserves relative expression, it loses absolute expression values and the quantitative differences between genes at different ranks. Although Figure 9 suggests minimal information loss for recovery, the impact on subtle biological inferences (e.g., dose-response relationships) should be continuously evaluated.

Applicability to Other Domains: The Cell2Sentence paradigm has immense transfer potential beyond single-cell RNA-seq. Any high-dimensional, sparse data where features have meaningful "names" and where relative importance matters could be similarly transformed into a "language" for LLMs. Examples include:

Proteomics/Metabolomics: Representing protein or metabolite profiles as "sentences" of ranked abundances.
Genomics (beyond expression): Encoding genomic variants or epigenomic marks as ordered sequences of features.
Clinical Data: Transforming patient records (symptoms, diagnoses, treatments) into structured textual narratives for LLM-based reasoning.
Materials Science: Representing material compositions or structural properties as "sentences" of elements or motifs.

The C2S-Scale paper is a landmark achievement, demonstrating that the LLM paradigm can profoundly reshape how we analyze and interpret complex scientific data, ushering in an era of AI-driven scientific discovery.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

SCALING LARGE LANGUAGE MODELS FOR NEXT-GENERATION SINGLE-CELL ANALYSIS

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 48,028 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Single-Cell RNA Sequencing (scRNA-seq)

Large Language Models (LLMs)

Reinforcement Learning (RL)

3.2. Previous Works

Cell2Sentence (C2S) Framework [13, 14]

Existing Single-Cell Foundation Models (scFMs)

General-Purpose LLMs

Reinforcement Learning in NLP

Generative Model Evaluation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data Collection and Preparation

4.2.2. Cell Sentence Transformation

4.2.3. LLM Architecture and Components

4.2.4. Pre-training (Two-Phase Approach)

4.2.5. Post-training Methods (Fine-tuning and Reinforcement Learning)

4.2.6. Long-Context and Multi-Cell Capabilities

4.2.7. Diverse Downstream Applications

5. Experimental Setup

5.1. Datasets

Pre-training Corpus

Downstream Task-Specific Datasets

5.2. Evaluation Metrics

Overlap %

BERTScore [25]

Maximum Mean Discrepancy (MMD)

Wasserstein Distance

Single-Cell Fréchet Inception Distance (scFID)

Kendall's τ\tauτ (Tau)

Pearson's rrr (Pearson Correlation Coefficient)

5.3. Baselines

Specialized Single-Cell Models

General-Purpose Large Language Models (LLMs)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance Scaling Laws

6.1.2. State-of-the-Art Predictive and Generative Capabilities

6.1.3. Natural Language Interpretation at Multiple Scales of Biology

6.1.4. Spatial Reasoning from Multi-cell Context and Interaction Data

6.1.5. Single-Cell Question Answering (QA) through Reinforcement Learning

6.1.6. Perturbation Response Prediction

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.2.1. Addressing Limitations of Causal Attention in Gene Expression Modeling

7.2.2. Hallucination and Interpretability

7.3. Personal Insights & Critique

Similar papers

Kendall's $\tau$ (Tau)

Pearson's $r$ (Pearson Correlation Coefficient)