3 Ambient RNA removal

The cell suspension in single nucleus transcriptome sequencing often contains a low-to-moderate concentration of cell-free mRNA or other capturable features, leading to nonzero molecule counts in cell-free droplets. These ‘ambient’ molecules originate from sources such as ruptured or degraded cells, residual cytoplasmic debris (e.g., in snRNA-seq), or exogenous contaminants. This systematic background noise can introduce batch effects and result in spurious differential gene expression. To address this, we apply CellBender, a deep generative model that accurately distinguishes cell-containing droplets from cell-free ones. By learning the background noise profile, it provides end-to-end, noise-free quantification, enhancing the reliability of downstream analysis.

The primary module in the current version of CellBender is remove-background. This module effectively eliminates counts attributed to ambient RNA molecules and random barcode swapping from raw UMI-based scRNA-seq gene-by-cell count matrices. It is recommended to run remove-background as a pre-processing step before any downstream analysis using tools such as Seurat, scanpy, or custom analysis pipelines. The module supports several file formats for count matrices, including:

Raw .h5 files generated by the cellranger count pipeline
The raw_feature_bc_matrix folder produced by cellranger count pipline

3.1 Usage

In this project, we utilize CellBender (version 0.3.0) to process the raw_feature_bc_matrix, removing empty droplets and ambient RNA molecule counts from the count matrices, and estimating the true cells present. Below is an example from one sample from adipose tissue datasets.

3.1.1 Installation

We installed CellBender (v0.3.0) in a dedicated conda environment by following the installation protocol provided by the developers. You can find the detailed instructions on the documentation: https://cellbender.readthedocs.io/en/v0.3.0/installation/index.html.

3.1.2 Getting started

First, activate the own conda enviroment of CellBender:

(base)$ conda activate cellbender-env

Then, run cellbender remove-backage using the command(leave out the flag –cuda if you are not using a GPU):

(cellbender-env)$ cellbender remove-background \
    --input ./raw_feature_bc_matrix.h5 \
    --output ./output/demo.h5 \
    --cuda

(The output filename “demo.h5” can be replaced with a filename of choice.)

The output of remove-background is a new .h5 count matrix with ambient RNA removed. This matrix can be directly used in downstream analyses with tools like Seurat or scanpy, just as if it were the original raw dataset. You can then proceed with per-cell quality control checks and filter out dead or dying cells, as appropriate for your experiment.

3.2 Caveats and hints

3.2.1 Main output files

output_report.html: An HTML report that includes plots, commentary, and any warnings or suggestions for improving parameter settings.

output.h5: A full count matrix in .h5 format with background RNA removed, retaining all the original droplet barcodes.

output.pdf: A PDF file providing a graphical summary of the inference procedure.

3.2.2 Recommendation for setting arguments

As of v0.3.0, users will typically not need to set values for –expected-cells or –total-droplets-included, as CellBender will choose reasonable values based on your dataset. If something goes wrong with these defaults, then you can try to input these arguments manually.

Considerations for setting parameters(from the official documentation):

–epochs: 150 is typically a good choice. Look for a reasonably-converged ELBO value in the output PDF learning curve (meaning it looks like it has reached some saturating value). Though it may be tempting to train for more epochs, it is not advisable to over-train, since this increases the likelihood of over-fitting. (We regularize to prevent over-fitting, but training for more than 300 epochs is too much.)

–expected-cells: Base this on either the number of cells expected a priori from the experimental design, or if this is not known, base this number on the UMI curve as shown below, where the appropriate number would be 5000. Pick a number where you are reasonably sure that all droplets to the left on the UMI curve are real cells.

–total-droplets-included: Choose a number that goes a few thousand barcodes into the “empty droplet plateau”. Include some droplets that you think are surely empty. But be aware that the larger this number, the longer the algorithm takes to run (linear). See the UMI curve below, where an appropriate choice would be 15,000. Every droplet to the right of this number on the UMI curve should be surely-empty. (This kind of UMI curve can be seen in the web_summary.html output from cellranger count.)

–cuda: Include this flag. The code is meant to be run on a GPU.

–learning-rate: The default value of 1e-4 is typically fine, but this value can be adjusted if problems arise during quality-control checks of the learning curve (as above).

–fpr: A value of 0.01 is the default, and represents a fairly conservative setting, which is appropriate for most analyses. In order to examine a single dataset at a time and remove more noise (at the expense of some signal), choose larger values such as 0.05 or 0.1. Bear in mind that the value 1 represents removal of (nearly) every count in the dataset, signal and noise. You can generate multiple output count matrices in the same run by choosing several values: 0.0 0.01 0.05 0.1.

3.3 More information

If you want to get more information, please refer to the documentation and GitHub: https://cellbender.readthedocs.io/en/v0.3.0/index.html https://github.com/broadinstitute/CellBender/tree/v0.3.0