6 Data integration

After filtering, mitochondrial, ribosomal protein-coding and leukocyte antigen genes were removed from these 5 datasets. Batch effect were corrected by applying following integration tools: CCA and RPCA performed in the ‘IntegrateLayers’ function which is a streamlined integrative analysis from Seurat, Harmony as well as scVI(version 1.1.2). We used all genes in scVI integration, and Seurat objects were transcribed into anndata objects using scfetch (version 0.5.0) in R with reticulate (1.37.0).

The integration of single-cell sequencing datasets—such as those from different experimental batches, donors, or conditions—is often a critical step in scRNA-seq workflows. Integrative analysis helps align shared cell types and states across datasets, enhancing statistical power and, more importantly, enabling accurate comparative analysis. In previous versions of Seurat, we introduced anchor-based integration methods to facilitate this process. Additionally, several labs have developed innovative tools for integration, including Harmony and scVI, which have proven to be powerful methods for scRNA-seq analysis. For more information, please refer to our vignette on integrating scRNA-seq data using multiple tools.

6.1 Library all the packages

library(reticulate)
library(scfetch)
library(Seurat)
library(SeuratData)
library(harmony)
library(dplyr)
library(patchwork)
library(sctransform)
library(ggplot2)

6.2 Load all the objects used for batch-effect correction

cohort_1 <- readRDS('./data/DemoIntegrated/cohort_1_s.rds')
cohort_2_s_1 <- readRDS('./data/DemoIntegrated/cohort_2_s_1_s.rds')
cohort_2_s_2 <- readRDS('./data/DemoIntegrated/cohort_2_s_2_s.rds')
cohort_3 <- readRDS('./data/DemoIntegrated/cohort_3_s.rds')
cohort_4_o_1 <- readRDS('./data/DemoIntegrated/cohort_4_o_1_s.rds')
cohort_4_o_2 <- readRDS('./data/DemoIntegrated/cohort_4_o_2_s.rds')
cohort_4_o_3 <- readRDS('./data/DemoIntegrated/cohort_4_o_3_s.rds')
cohort_4_o_4 <- readRDS('./data/DemoIntegrated/cohort_4_o_4_s.rds')
cohort_4_o_5 <- readRDS('./data/DemoIntegrated/cohort_4_o_5_s.rds')

6.3 Pre-processing before data integration

Add Donor ID as batch information

cohort_1@meta.data$sampleid <- 'ID1'
cohort_3@meta.data$sampleid <- 'ID2'
cohort_2_s_1@meta.data$sampleid <- 'ID3'
cohort_2_s_2@meta.data$sampleid <- 'ID4'
cohort_4_o_1@meta.data$sampleid <- 'ID5'
cohort_4_o_2@meta.data$sampleid <- 'ID6'
cohort_4_o_3@meta.data$sampleid <- 'ID7'
cohort_4_o_4@meta.data$sampleid <- 'ID8'
cohort_4_o_5@meta.data$sampleid <- 'ID9'

Create a object which contains multiple ‘layers’ to help to integrate all samples together, so that we can jointly identify cell subpopulations across datasets, and then explore how each group differs across conditions.

NC_list <- list(cohort_1, cohort_2_s_1, cohort_2_s_2, cohort_3,
                cohort_4_o_1, cohort_4_o_2, cohort_4_o_3, cohort_4_o_4, cohort_4_o_5)

for (i in 1:length(NC_list)) {
  DefaultAssay(NC_list[[i]]) <- 'RNA'
  NC_list[[i]] <- SCTransform(NC_list[[i]], vars.to.regress = "percent.mt", verbose = FALSE)
}

NC_merged <- merge(x = NC_list[[1]],
                     y = NC_list[2:length(NC_list)])

In order to better identify the cell types, we plan to remove MT-/RP-/HLA- related genes before integration

gene_mt <- rownames(NC_merged)[grep("^MT-", rownames(NC_merged))]
gene_rb <- rownames(NC_merged)[grep("^RP[SL]", rownames(NC_merged))]
gene_hla <- rownames(NC_merged)[grep("^HLA-", rownames(NC_merged))]

gene_rb <- as.data.frame(gene_rb)
gene_hla <- as.data.frame(gene_hla)
gene_mt <- as.data.frame(gene_mt)
gene <- rownames(NC_merged)
gene <- as.data.frame(gene)

gene$type <- gene$gene %in% gene_rb$gene_rb
gene <- gene[gene$type == "FALSE",]

gene$type <- gene$gene %in% gene_hla$gene_hla
gene <- gene[gene$type == "FALSE",]

gene$type <- gene$gene %in% gene_mt$gene_mt
gene <- gene[gene$type == "FALSE",]
gene_list <- gene$gene

for (i in 1: length(NC_list)) {
  DefaultAssay(NC_list[[i]]) <- 'RNA'
  counts <- GetAssayData(NC_list[[i]], layer = 'counts')
  counts <- counts[rownames(counts) %in% gene_list, ]
  meta.data <- NC_list[[i]]@meta.data
  NC_list[[i]] <- CreateSeuratObject(counts = counts,
                                       meta.data = meta.data)
  
  NC_list[[i]] <- SCTransform(NC_list[[i]], vars.to.regress = "percent.mt", verbose = FALSE)
}

NC_merged <- merge(NC_list[[1]], 
                     y = NC_list[2:length(NC_list)],
                     merge.data = TRUE)
NC_merged

6.4 Peform analysis without integration

We begin by analyzing the dataset without performing integration to assess whether batch-effect correction is necessary. The resulting clusters are defined by both cell type and stimulation condition, which poses challenges for downstream analysis.

DefaultAssay(NC_merged) <- 'RNA'

NC_merged <- SCTransform(NC_merged, verbose = T) 

NC_merged <- RunPCA(NC_merged, npcs = 100, verbose = T)
ElbowPlot(NC_merged, ndims = 50)

NC_merged <- RunUMAP(NC_merged, dims = 1:30, reduction = "pca", reduction.name = "unintegrated")

DimPlot(NC_merged, reduction = "unintegrated", group.by = "sampleid")

6.5 Perform integration

We now aim to integrate data from all the donors, so that cells from the same cell type/subpopulation will cluster together.For data integration, our goal is not to remove biological differences across conditions or different batches, but to learn shared cell types/states in an initial step - specifically because that will enable us to compare control stimulated and control profiles for these individual cell types.

Seurat v5 supports a more streamlined integrated analysis through the use of the IntegrateLayers function. The current method supports five integration methods. Each of these methods performs integration in a low-dimensional space and returns a dimensionality reduction (i.e. integrated.rpca) designed to embed shared unit types across batches, here we use three of them:

Anchor-based CCA integration (method=CCAIntegration)
Anchor-based RPCA integration (method=RPCAIntegration)
Harmony (method=HarmonyIntegration)

6.5.1 Anchor-based CCA integration

The Seurat v5 integration procedure aims to return a single dimensional reduction that captures the shared sources of variance across multiple layers, so that cells in a similar biological state will cluster. The method returns a dimensional reduction (i.e. integrated.cca) which can be used for visualization and unsupervised clustering analysis. For evaluating performance, we can use cell type labels that are pre-loaded in the seurat_annotations metadata column.

NC_merged <- IntegrateLayers(
  object = NC_merged, method = CCAIntegration,
  normalization.method = "SCT", 
  verbose = FALSE)

NC_merged <- FindNeighbors(NC_merged, reduction = "integrated.dr", dims = 1:30, verbose = FALSE)
NC_merged <- FindClusters(NC_merged, verbose = FALSE)
NC_merged <- RunUMAP(NC_merged, reduction = "integrated.dr", dims = 1:30, reduction.name = "cca", verbose = FALSE)

6.5.2 Anchor-based RPCA integration

NC_merged <- IntegrateLayers(
  object = NC_merged, method = RPCAIntegration,
  normalization.method = "SCT", 
  new.reduction = "integrated.rpca",
  verbose = FALSE)

NC_merged <- FindNeighbors(NC_merged, reduction = "integrated.rpca", dims = 1:30, verbose = FALSE)
NC_merged <- FindClusters(NC_merged, verbose = FALSE)
NC_merged <- RunUMAP(NC_merged, reduction = "integrated.rpca", dims = 1:30, reduction.name = "rpca", verbose = FALSE)

6.5.3 Harmony

NC_merged <- IntegrateLayers(
  object = NC_merged, method = HarmonyIntegration,
  normalization.method = "SCT", 
  new.reduction = "integrated.harmony",
  verbose = FALSE)

NC_merged <- FindNeighbors(NC_merged, reduction = "integrated.harmony", dims = 1:30, verbose = FALSE)
NC_merged <- FindClusters(NC_merged, verbose = FALSE)
NC_merged <- RunUMAP(NC_merged, reduction = "integrated.harmony", dims = 1:30, reduction.name = "harmony", verbose = FALSE)

6.5.4 scVI

scvi-tools (single-cell variational inference tools) is a package for probabilistic modeling of single-cell omics data, built on top of PyTorch and AnnData. The package hosts implementations of several models that perform a wide range of single-cell data analysis tasks, as well as the building blocks to rapidly prototype new probabilistic models.

For using scVI in R, scVI integration requires installing reticulate and scfetch as well as scvi-tools and its depencies in its environment. You can find the detailed instructions on the documentation: https://scvi-tools.org/get_started/

use_condaenv("~/miniconda3/envs/scvi", required=T)
sc <- import("scanpy", convert = FALSE)
scvi <- import("scvi", convert = FALSE)

# prepare an anndata
NC_merged@assays$SCT@data <- NC_merged@assays$SCT@counts
dim(NC_merged@assays$SCT@data)
ExportSeurat(seu.obj = NC_merged, assay = "SCT", to = "AnnData",
             conda.path = "~/miniconda3/envs/scvi",
             anndata.file = "./data/DemoIntegrated/merged.h5ad")

adata <- sc$read('./data/DemoIntegrated/merged.h5ad')

# run setup_anndata
scvi$model$SCVI$setup_anndata(adata, batch_key = 'sampleid') 
model = scvi$model$SCVI(adata)

# train the model
model$train()
#  get the latent represention
latent = model$get_latent_representation()

# put it back in the original Seurat object
latent <- as.matrix(latent)
rownames(latent) = colnames(NC_merged)
NC_merged[["integrated.scvi"]] <- CreateDimReducObject(embeddings = latent, key = "scvi_", assay = DefaultAssay(NC_merged))

NC_merged <- FindNeighbors(NC_merged, reduction = "integrated.scvi", dims = 1:10, verbose = FALSE)
NC_merged <- FindClusters(NC_merged, resolution = 0.6, verbose = FALSE)

NC_merged <- RunUMAP(NC_merged, dims = 1:10, reduction = 'integrated.scvi', reduction.name = 'scvi', verbose = FALSE)

6.5.5 scANVI

# run setup_anndata
model = scvi$model$SCANVI(adata)

# train the model
model$train()
#  get the latent represention
latent = model$get_latent_representation()

# put it back in the original Seurat object
latent <- as.matrix(latent)
rownames(latent) = colnames(NC_merged)
NC_merged[["integrated.scanvi"]] <- CreateDimReducObject(embeddings = latent, key = "scanvi_", assay = DefaultAssay(NC_merged))

NC_merged <- FindNeighbors(NC_merged, reduction = "integrated.scanvi", dims = 1:10, verbose = FALSE)
NC_merged <- FindClusters(NC_merged, resolution = 0.6, verbose = FALSE)

NC_merged <- RunUMAP(NC_merged, dims = 1:10, reduction = 'integrated.scanvi', reduction.name = 'scanvi', verbose = FALSE)

6.6 Visualition

reduction_name <- c('unintegrated','rpca','cca','harmony','scvi','scanvi')
for (i in 1:length(reduction_name)) {
  a <- DimPlot(NC_merged, reduction = reduction_name[i], label = TRUE) + NoLegend()
  b <- DimPlot(NC_merged, reduction = reduction_name[i], group.by = "sampleid")
  c <- DimPlot(NC_merged, reduction = reduction_name[i], group.by = "celltype")
  
  print(a+b+c)
}

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.4
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##   [1] deldir_2.0-4           pbapply_1.7-2          gridExtra_2.3          rlang_1.1.4            magrittr_2.0.3         RcppAnnoy_0.0.22      
##   [7] spatstat.geom_3.3-3    matrixStats_1.4.1      ggridges_0.5.6         compiler_4.3.2         png_0.1-8              vctrs_0.6.5           
##  [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3        fastmap_1.2.0          utf8_1.2.4             promises_1.3.0        
##  [19] rmarkdown_2.28         purrr_1.0.2            xfun_0.47              cachem_1.1.0           jsonlite_1.8.9         goftest_1.2-3         
##  [25] highr_0.11             later_1.3.2            spatstat.utils_3.1-0   irlba_2.3.5.1          parallel_4.3.2         cluster_2.1.6         
##  [31] R6_2.5.1               ica_1.0-3              bslib_0.8.0            spatstat.data_3.1-2    stringi_1.8.4          RColorBrewer_1.1-3    
##  [37] reticulate_1.39.0      spatstat.univar_3.0-1  parallelly_1.38.0      jquerylib_0.1.4        lmtest_0.9-40          scattermore_1.2       
##  [43] Rcpp_1.0.13            bookdown_0.40          knitr_1.48             tensor_1.5             future.apply_1.11.2    zoo_1.8-12            
##  [49] sctransform_0.4.1      httpuv_1.6.15          Matrix_1.6-5           splines_4.3.2          igraph_2.0.3           tidyselect_1.2.1      
##  [55] viridis_0.6.5          abind_1.4-8            rstudioapi_0.16.0      yaml_2.3.10            spatstat.random_3.3-2  codetools_0.2-20      
##  [61] miniUI_0.1.1.1         spatstat.explore_3.3-2 listenv_0.9.1          lattice_0.22-6         tibble_3.2.1           plyr_1.8.9            
##  [67] shiny_1.9.1            ROCR_1.0-11            evaluate_1.0.0         Rtsne_0.17             future_1.34.0          fastDummies_1.7.4     
##  [73] survival_3.7-0         polyclip_1.10-7        fitdistrplus_1.2-1     pillar_1.9.0           Seurat_5.1.0           KernSmooth_2.23-24    
##  [79] plotly_4.10.4          generics_0.1.3         RcppHNSW_0.6.0         sp_2.1-4               ggplot2_3.5.1          munsell_0.5.1         
##  [85] scales_1.3.0           globals_0.16.3         xtable_1.8-4           glue_1.7.0             lazyeval_0.2.2         tools_4.3.2           
##  [91] data.table_1.16.0      RSpectra_0.16-2        RANN_2.6.2             leiden_0.4.3.1         dotCall64_1.1-1        cowplot_1.1.3         
##  [97] grid_4.3.2             tidyr_1.3.1            colorspace_2.1-1       nlme_3.1-166           patchwork_1.3.0        cli_3.6.3             
## [103] spatstat.sparse_3.1-0  spam_2.10-0            fansi_1.0.6            viridisLite_0.4.2      dplyr_1.1.4            uwot_0.2.2            
## [109] gtable_0.3.5           sass_0.4.9             digest_0.6.37          progressr_0.14.0       ggrepel_0.9.6          htmlwidgets_1.6.4     
## [115] SeuratObject_5.0.2     farver_2.1.2           htmltools_0.5.8.1      lifecycle_1.0.4        httr_1.4.7             mime_0.12             
## [121] MASS_7.3-60.0.1