Introduction

This SCIPAC R package contains the proposed SCIPAC algorithm (function SCIPAC), which is a computational method that identifies cells in single-cell data that are associated with a given phenotype.

In this tutorial (R version: 4.1.2), we use several examples to explain the functions in SCIPAC package. We first load the required package:

library(SCIPAC)

SCIPAC examples

SCIPAC uses input data consisting of three parts: single-cell RNA-seq data that measures the expression of $p$ genes in $m$ cells, bulk RNA-seq data that measures the expression of the same set of $p$ genes in $n$ samples/tissues, and the phenotype of each sample in the bulk RNA-seq data.

Applying SCIPAC to data with a binary phenotype

In this example, we use SCIPAC for simulated scRNA-seq and bulk RNA-seq data with a binary phenotype.

Preparing scRNA-seq and bulk RNA-seq data

We first load the scRNA-seq count data

sc.dat <- readRDS(system.file("extdata", "sim.raw.scRNA.rds", package = "SCIPAC"))
# Check the first few rows and columns.
print(sc.dat[1:4, 1:10])

##       Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8 Cell9 Cell10
## Gene1     0     0     0     0     0     0     0     0     0      0
## Gene2     2     0     0     0     1     4     0     3     4      1
## Gene3     0     3     0     5     6     0     1     5     0      5
## Gene4     0     0     0     0     0     0     1     1     0      2

In the simulated scRNA-seq data, rows stand for genes and columns stand for cells. The dimension of this data is

dim(sc.dat)

## [1] 10000 10000

The corresponding meta data for this scRNA-seq data is

sc.meta <- readRDS(system.file("extdata", "sim.raw.scRNA.meta.rds", package = "SCIPAC"))
# Check the first few rows
head(sc.meta)

We next check the cell types

table(sc.meta$Group)

## 
## Group1 Group2 Group3 
##   7991   1013    996

There are three cell types, which are Group1, Group2, and Group3. The Group2 and Group3 cells are phenotype-associated cells while Group1 are not.

The bulk RNA-seq data is loaded in the following

bulk.dat <- readRDS(system.file("extdata", "bulk.dat.raw.rds", package = "SCIPAC"))
# Check the first few rows and columns
print(bulk.dat[1:4, 1:6])

##         pheno1.1   pheno1.2   pheno1.3   pheno1.4   pheno1.5   pheno1.6
## Gene1 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## Gene2 0.37285248 0.37344484 0.37137879 0.37485306 0.37901400 0.37648699
## Gene3 0.25711978 0.26110376 0.26428917 0.26759723 0.26267130 0.26597588
## Gene4 0.06758342 0.06710196 0.06853341 0.06572419 0.06809027 0.06984578

The bulk RNA-seq data has been normalized. Rows stand for genes and columns stand for cells.

The corresponding meta data for this bulk data is

bulk.meta <- readRDS(system.file("extdata", "bulkRNA.meta.rds", package = "SCIPAC"))
# Check the first few rows
head(bulk.meta)

We next check the phenotypes

table(bulk.meta$phenotype)

## 
## phenotype.1 phenotype.2 
##          50          50

There are 100 bulk samples with two classes, which are phenotype.1 and phenotype.2. This can be simply considered as cancer vs. normal samples.

Since we have already had all the input data, let’s perform the pre-processing steps.

sc.bulk.prec <- preprocess.sc.bulk.dat(sc.dat, bulk.dat, hvg = 1000)

The preprocess.sc.bulk.dat function is embedded in SCIPAC package, and it first finds the overlap genes of the scRNA-seq and bulk RNA-seq data and uses these genes for scRNA-seq and bulk data normalization. The normalization of the scRNA-seq data in SCIPAC uses the normalization steps in Seurat R package and the 1000 highly variable genes are identified. For the bulk RNA-seq data, since it has been normalized, we further take the log-transformation and use the 1000 highly variable genes identified in scRNA-seq data. Users can also use their own pre-processing steps for scRNA-seq and bulk RNA-seq data, as long as they have the identical set of features. The pre-processed scRNA-seq and bulk RNA-seq data are

sc.dat.prep <- sc.bulk.prec$sc.dat.preprocessed
bulk.dat.prep <- sc.bulk.prec$bulk.dat.preprocessed

Let’s check the dimension for the pre-processed data

dim(sc.dat.prep)

## [1]  1000 10000

dim(bulk.dat.prep)

## [1] 1000  100

For both the pre-processed scRNA-seq and bulk RNA-seq data, rows stand for genes and columns stand for cells or bulk samples. The 1000 highly variable genes are used for downstream analysis.

We next perform dimension reduction on sc.dat.prep and bulk.dat.prep.

pca.res <- sc.bulk.pca(sc.dat.prep, bulk.dat.prep, do.pca.sc = FALSE, n.pc = 60)

The sc.bulk.pca function is embedded in SCIPAC package, and it uses PCA to perform the dimension reduction. By setting do.pca.sc = FALSE, sc.bulk.pca first performs PCA on the bulk data and then uses the rotation matrix on scRNA-seq data. If do.pca.sc = TRUE, the procedure is reversed. The number of chosen PCs, by default, is n.pc = 60. Users can also use their own dimension reduction methods. The outputs are

sc.dat.dim.redu <- pca.res$sc.dat.rot
bulk.dat.dim.redu <- pca.res$bulk.dat.rot

Let’s also check the dimension for the dimension-reduced data

dim(sc.dat.dim.redu)

## [1] 10000    60

dim(bulk.dat.dim.redu)

## [1] 100  60

For the dimension-reduced scRNA-seq and bulk RNA-seq data, rows stand for cells or bulk samples and columns stand for the first 60 PCs.

Clustering

After pre-processing, we cluster the scRNA-seq data into K groups and obtain K cluster centroids before using SCIPAC to identify the phenotype-associated cells.

ct.res <- seurat.ct(sc.dat.dim.redu, res = 1.5)

## Computing nearest neighbor graph

## Computing SNN

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10000
## Number of edges: 204925
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.4821
## Number of communities: 9
## Elapsed time: 0 seconds

## 5 singletons identified. 4 final clusters.

The seurat.ct function is embedded in SCIPAC package. It uses the clustering steps in Seurat R package. Users can also use their own clustering methods to annotate each cell. The parameter res controls resolutions. A larger values of res leads to more clusters. We set res = 1.5 in our example.

summary(ct.res)

##               Length Class      Mode   
## k               1    -none-     numeric
## ct.assignment   1    data.frame list   
## centers       240    -none-     numeric

There are three elements in the list ct.res, which are k, ct.assignment, and centers. Let’s look at them one by one.

k <- ct.res$k
print(k)

## [1] 4

The element k is a numeric value, indicating the total number of clusters.

ct.assignment <- ct.res$ct.assignment
head(ct.assignment)

The element ct.assignment is a data frame with one column named cluster_assignment. The cluster assignments should be numeric values. Row names of ct.assignment are cell IDs.

centers <- ct.res$centers
print(centers[1:6, 1:3])

##              [,1]         [,2]         [,3]
## PC1 -0.0685890640 -0.309812316  0.846524159
## PC2  0.5173826752 -1.941603773 -2.002010893
## PC3 -0.0138228728  0.024154444  0.091536070
## PC4 -0.0088310989  0.066406564  0.011236493
## PC5  0.0088485606 -0.043443053 -0.022339515
## PC6  0.0005926077  0.002267678 -0.004783318

The element centers is a data matrix, whose rows are PCs and columns are the centroids of each cluster.

For the clustering step, users do not have to use the function seurat.ct to obtain clusters. As long as users can generate a list with three elements named k, ct.assignment, and centers, as described above, it should work in the downstream analysis. Please note that the column name of ct.assignment should be named cluster_assignment.

Executing SCIPAC to identify phenotype-associated cells

We first create the binary labels for bulk samples

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

y <- c(rep(0, 50), rep(1, 50)) %>% as.factor()

The first 50 bulk samples are treated as class 0 and the remaining 50 samples are treated as class 1. In real application, we can say the first 50 samples are normal samples and the remaining 50 samples are cancer samples.

After obtaining all the necessary input, we can finally use SCIPAC to identify the phenotype-associated cells.

SCIPAC.res <- SCIPAC(bulk.dat = bulk.dat.dim.redu, y = y, 
                     family = "binomial", ct.res = ct.res, 
                     ela.net.alpha = 0.4, bt.size = 50, 
                     numCores = 7, CI.alpha = 0.05, nfold = 10)

In the SCIPAC function, we set family = "binomial" since we use binary input. For other parameters ela.net.alpha, bt.size, numCores, and CI.alpha, we use the default values in SCIPAC. We next have a look at the output results.

head(SCIPAC.res)

The rownames of SCIPAC.res are cell IDs. There are six columns in SCIPAC.res. The cluster_assignment is the cluster indices for each cell. Lambda.est is the estimated strength of the association between cells and a given phenotype. Lambda.upper and Lambda.lower are the upper and lower confidence intervals for Lambda.est. sig contains values Not.sig, Sig.pos and Sig.neg. It roughly describes whether a cell is significantly positive- or negative-associated with a phenotype. log.pval is the $\log_{10}$(p-values) for Lambda.est. Note that the absolute values of log.pval is the absolute values of $\log_{10}$(p-values). The signs of log.pval correspond to postive or negative values of Lambda.est.

Visualization of results

We use UMAP to visually present SCIPAC’s output. We first load necessary R packages and function for creating plots and do some adjustments for our data.

library(uwot)

## Loading required package: Matrix

library(ggplot2)
library(ggrepel)
library(ggthemes)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggthemes':
## 
##     theme_map

library(ggnewscale)

do_umap <- function(V) {
  umap(
    X = V,
    n_threads = 6,
    n_neighbors = 30L,
    n_components = 2L,
    metric = 'cosine',
    n_epochs = NULL,
    learning_rate = 1.0,
    min_dist = 0.3,
    spread = 1.0,
    set_op_mix_ratio = 1.0,
    local_connectivity = 1L,
    repulsion_strength = 1,
    negative_sample_rate = 1,
    a = NULL,
    b = NULL,
    fast_sgd = FALSE,
    verbose = FALSE
  )
}

rownames(sc.meta) <- sc.meta$Cell
all(rownames(sc.meta) == rownames(SCIPAC.res))

## [1] TRUE

# Use UMAP to obtain two-dimension representations.
sc.umap <- do_umap(sc.dat.dim.redu)
colnames(sc.umap) <- c("umap.X1", "umap.X2")

# Create the data frame for ggplot
plot.dat <- cbind(sc.meta, SCIPAC.res)
all(rownames(plot.dat) == rownames(sc.umap))

## [1] TRUE

plot.dat <- cbind(plot.dat, sc.umap)

# Assign 0 values of log.pval to insignificant cells
plot.dat$log.pval.adj <- ifelse((plot.dat$sig == "Not.sig"),
                                0, plot.dat$log.pval)

# Adjust the cell type lables and create the cell type plot
plot.dat$phe <- NA
plot.dat[which(plot.dat$Group == "Group1"), ]$phe <- "Cell type 3"
plot.dat[which(plot.dat$Group == "Group2"), ]$phe <- "Cell type 2"
plot.dat[which(plot.dat$Group == "Group3"), ]$phe <- "Cell type 1"

plot.dat$phe <- factor(plot.dat$phe, levels = c("Cell type 1",
                                                "Cell type 2",
                                                "Cell type 3"))
colors.cell <- c("red", "blue", "lightgrey")
p1 <- ggplot(plot.dat) +
  geom_point(aes(x=umap.X1, y=umap.X2, col = phe, fill = phe)) +
  scale_color_manual(name = "Phenotype", values = colors.cell) +
  scale_fill_manual(name = "Phenotype", values = colors.cell) +
  guides(colour = guide_legend(override.aes = list(size=4))) +
  theme(legend.title = element_blank(),
        legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1),
        legend.key = element_blank(),
        legend.background = element_rect(fill = "white", size = 0.1))+
  labs(x = "UMAP 1", y = "UMAP 2")

# Control the range of the estimate of Lambda.Psi within -2 and 2
plot.dat$Lambda.adj <- plot.dat$Lambda.est
if(sum(plot.dat$Lambda.est <= -2) > 0){
  plot.dat[which(plot.dat$Lambda.est <= -2), ]$Lambda.adj <- -2
}
if(sum(plot.dat$Lambda.est >= 2) > 0){
  plot.dat[which(plot.dat$Lambda.est >= 2), ]$Lambda.adj <- 2
}
p2 <- ggplot(plot.dat) + 
  geom_point(aes(x=umap.X1, y=umap.X2, col=Lambda.adj)) +
  scale_color_gradient2(expression(Lambda), low="darkblue",high='red', mid = "white", midpoint = 0,
                        limits = c(-2, 2), breaks = c(-2, -1, 0, 1, 2),
                        labels = c(expression(phantom(x) <= -2), -1, 0, 1, expression(phantom(x) >= 2))) +
  theme(legend.title = element_text(family = "sans", size = 15),
        legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1))+ 
  labs(x = "UMAP 1", y = "UMAP 2")

# Control the range of log.pval.adj within -3 and 3
plot.dat$pval.adj.adj <- plot.dat$log.pval.adj
if(sum(plot.dat$log.pval.adj <= -3) > 0){
  plot.dat[which(plot.dat$log.pval.adj <= -3), ]$pval.adj.adj <- -3
}
if(sum(plot.dat$log.pval.adj >= 3) > 0){
  plot.dat[which(plot.dat$log.pval.adj >= 3), ]$pval.adj.adj <- 3
}

plot.dat.positive <- plot.dat[which(plot.dat$pval.adj.adj > 0), ]
plot.dat.positive$pval.adj.adj <- -plot.dat.positive$pval.adj.adj

plot.dat.negative <- plot.dat[which(plot.dat$log.pval.adj <= 0), ]

p3 <- ggplot() + 
  geom_point(data = plot.dat.negative, aes(x=umap.X1, y=umap.X2, col=pval.adj.adj)) +
  scale_color_gradient2('   ', low="darkblue",high='white', mid = "white", midpoint = 0,
                        limits = c(-3, -1.3), breaks = c(-3, -2, -1.3), 
                        labels = c(expression(phantom(x) <= 0.001), 0.01, 0.05),
                        na.value = "white") +
  new_scale_color() + 
  geom_point(data = plot.dat.positive, aes(x=umap.X1, y=umap.X2, col=pval.adj.adj)) +
  scale_color_gradient2("p-values", low="red",high='white', mid = "white", midpoint = 0,
                        limits = c(-3, -1.3), breaks = c(-3, -2, -1.3), 
                        labels = c(expression(phantom(x) <= 0.001), 0.01, 0.05),
                        na.value = "white") +
  theme(legend.title = element_text(family = "sans", size = 15),
        legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1))+ 
  labs(x = "UMAP 1", y = "UMAP 2")

Let’s view the final plot results. It can be noted that SCIPAC correctly captures the two phenotype-associated cell types most of time in this simulated data with default settings.

cowplot::plot_grid(p1, p2, p3, nrow = 1)

Applying SCIPAC to data with an ordinal phenotype

Next, we use another simulated scRNA-seq data and bulk RNA-seq data with ordinal phenotype information in SCIPAC.

Clustering

The pre-processing steps are the same as those mentioned in the previous section. Therefore, we omit the pre-processing steps and directly load the data.

# Load the pre-processed and dimension-reduced scRNA-seq data
sc.dat.ordinal <- readRDS(system.file("extdata", "sc.dat.ordinal.rds", package = "SCIPAC"))
# Load the meta data for scRNA-seq data
sc.meta.ordinal <- readRDS(system.file("extdata", "sim.raw.scRNA.meta.ordinal.rds", package = "SCIPAC"))
# Load the pre-processed and dimension-reduced bulk RNA-seq data
bulk.dat.ordinal <- readRDS(system.file("extdata", "bulk.dat.ordinal.rds", package = "SCIPAC"))
# Load the meta data for bulk RNA-seq data
bulk.meta.ordinal <- readRDS(system.file("extdata", "bulkRNA.meta.ordinal.rds", package = "SCIPAC"))

Let’s check the first few lines of sc.meta.ordinal and bulk.meta.ordinal

head(sc.meta.ordinal)

table(sc.meta.ordinal$Group)

## 
## Path1 Path2 Path3 Path4 
##  2491  2516  2479  2514

There are four paths in this generated scRNA-seq data and each path has more or less the same number of cells.

head(bulk.meta.ordinal)

table(bulk.meta.ordinal$Stages)

## 
##   Stage I  Stage II Stage III  Stage IV 
##        50        50        50        50

The bulk samples are divided into four stages, representing different developmental stages.

Similar to the steps for the binary information, we first perform the clustering.

ct.res.ordinal <- seurat.ct(sc.dat.ordinal, res = 2.0)

## Computing nearest neighbor graph

## Computing SNN

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10000
## Number of edges: 608841
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7489
## Number of communities: 16
## Elapsed time: 0 seconds

Executing SCIPAC

Before applying SCIPAC, we first obtain the input phenotype labels y.

all(rownames(bulk.dat.ordinal) == rownames(bulk.meta.ordinal)) # TRUE

## [1] TRUE

y <- factor(bulk.meta.ordinal$Stages, levels = c("Stage I", "Stage II", "Stage III", "Stage IV"))

Let’s run SCIPAC for ordinal data.

SCIPAC.res.ordinal <- SCIPAC(bulk.dat = bulk.dat.ordinal, y = y,
                             family = "cumulative", ct.res = ct.res.ordinal,
                             ela.net.alpha = 0.4, bt.size = 50,
                             numCores = 7, CI.alpha = 0.05, nfold = 10)

We set family = "cumulative" for ordinal input y and use the default values for other parameters.

Visualization of results

Let’s view the results generated by SCIPAC. For the cell paths, we use two-dimensional PCA to visualize the results.

# Load the two-dimensional PCA to scRNA-seq data
sc.pca.ordinal <- readRDS(system.file("extdata", "sc.pca.ordinal.rds", package = "SCIPAC"))

We first generate the input for ggplot.

rownames(sc.meta.ordinal) <- sc.meta.ordinal$Cell
sc.meta.ordinal <- sc.meta.ordinal[rownames(SCIPAC.res.ordinal), ]
plot.dat <- cbind(sc.meta.ordinal, SCIPAC.res.ordinal)
plot.dat <- cbind(plot.dat, sc.pca.ordinal)

# Assign 0 values of log.pval to those insignificant cells 
plot.dat$log.pval.adj <- ifelse((plot.dat$sig == "Not.sig"),
                                0, plot.dat$log.pval)

# Plot the paths
colors.cell <- tableau_color_pal("Classic 20",
                                 direction = 1)(length(unique(plot.dat$Group)))
p1 <- ggplot(plot.dat) +
  geom_point(aes(x=PC1, y=PC2, col = Group, fill = Group)) +
  scale_color_manual(name = "Differentiation path", values = colors.cell) +
  scale_fill_manual(name = "Differentiation path", values = colors.cell) +
  guides(colour = guide_legend(override.aes = list(size=4))) +
  theme(legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1),
        legend.key = element_blank(),
        legend.background = element_rect(fill = "white", size = 0.1))+
  labs(x = "PC 1", y = "PC 2")

# Plot the steps
plt <- plot.dat %>%
  ggplot(aes(x = PC1, y = PC2, col = Step)) +
  theme_tufte(base_size = 12) +
  theme(panel.background = element_rect(fill = NA, color = "black")) +
  scale_color_gradient2(name = "Differentiation step", low = "blue", high = "red", mid = "white", midpoint = 1000) +
  theme(plot.title = element_text(hjust = 0.5, family = "sans"),
        legend.key.size = unit(1, "cm"),
        legend.position= "top",
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()) +
  labs(x = "PC 1", y = "PC 2")
p2 <- plt + geom_point()

# Control the range of the estimate of Lambda.Psi within -2 and 2
plot.dat$Lambda.adj <- plot.dat$Lambda.est
if(sum(plot.dat$Lambda.est <= -2) > 0){
  plot.dat[which(plot.dat$Lambda.est <= -2), ]$Lambda.adj <- -2
}
if(sum(plot.dat$Lambda.est >= 2) > 0){
  plot.dat[which(plot.dat$Lambda.est >= 2), ]$Lambda.adj <- 2
}

p3 <- ggplot(plot.dat) + 
  geom_point(aes(x=PC1, y=PC2, col=Lambda.adj)) +
  scale_color_gradient2(expression(Lambda), low="darkblue",high='red', mid = "white", midpoint = 0,
                        limits = c(-2, 2), breaks = c(-2, -1, 0, 1, 2),
                        labels = c(expression(phantom(x) <= -2), -1, 0, 1, expression(phantom(x) >= 2))) +
  theme(legend.title = element_text(family = "sans", size = 27),
        legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1))+ 
  labs(x = "PC 1", y = "PC 2")

# Control the range of log.pval.adj within -3 and 3
plot.dat$pval.adj.adj <- plot.dat$log.pval.adj
if(sum(plot.dat$log.pval.adj <= -3) > 0){
  plot.dat[which(plot.dat$log.pval.adj <= -3), ]$pval.adj.adj <- -3
}
if(sum(plot.dat$log.pval.adj >= 3) > 0){
  plot.dat[which(plot.dat$log.pval.adj >= 3), ]$pval.adj.adj <- 3
}

plot.dat.positive <- plot.dat[which(plot.dat$pval.adj.adj > 0), ]
plot.dat.positive$pval.adj.adj <- -plot.dat.positive$pval.adj.adj

plot.dat.negative <- plot.dat[which(plot.dat$log.pval.adj <= 0), ]

p4 <- ggplot() + 
  geom_point(data = plot.dat.positive, aes(x=PC1, y=PC2, col=pval.adj.adj)) +
  scale_color_gradient2("p-values", low="red",high='white', mid = "white", midpoint = 0,
                        limits = c(-3, -1.3), breaks = c(-3, -2, -1.3), 
                        labels = c(expression(phantom(x) <= 0.001), 0.01, 0.05),
                        na.value = "white") +
  new_scale_color() + 
  geom_point(data = plot.dat.negative, aes(x=PC1, y=PC2, col=pval.adj.adj)) +
  scale_color_gradient2('   ', low="darkblue",high='white', mid = "white", midpoint = 0,
                        limits = c(-3, -1.3), breaks = c(-3, -2, -1.3), 
                        labels = c(expression(phantom(x) <= 0.001), 0.01, 0.05),
                        na.value = "white") +
  theme(legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1))+ 
  labs(x = "PC 1", y = "PC 2")

cowplot::plot_grid(p1, p2, p3, p4, nrow = 2)

Applying SCIPAC to data with survival information

In this section, we use the Lung Adenocarcinoma (LUAD) scRNA-seq data to show how to use SCIPAC in real applications with survival information. This LUAD scRNA-seq data is obtained from the following paper. For convenience, we put the pre-processed and dimension-reduced scRNA-seq data in the R package with its corresponding meta data, as loaded below.

# Load the pre-processed and dimension-reduced LUAD scRNA-seq data
LUAD.sc.dat <- readRDS(system.file("extdata", "LUAD.sc.dat.rds", package = "SCIPAC"))
# Load the meta data for LUAD scRNA-seq data
LUAD.sc.meta <- readRDS(system.file("extdata", "LUAD.meta.rds", package = "SCIPAC"))

Let’s check the first few rows of the meta data.

head(LUAD.sc.meta)

Working with TCGA database

To obtain the bulk RNA-seq data and the corresponding clinical information, we use TCGAbiolinks R package. For more information about this package, users can refer to the TCGAbiolinks tutorial. Apart from using TCGA database, users can also use cBioPortal to download bulk RNA-seq data of interests. In the following steps, we provide the steps of obtaining the bulk RNA-seq data and its corresponding clinical information for LUAD patients.

Considering that it may take a while to download the bulk RNA-seq data and the clinical information, we only provide the code without running it.

# library(TCGAbiolinks)
# library(SummarizedExperiment)
### Obtain patient and count data information
# query <- GDCquery(
#   project = "TCGA-LUAD", # The project name used. We choose the TCGA-LUAD data
#   data.category = "Gene expression",
#   data.type = "Gene expression quantification",
#   platform = "Illumina HiSeq", 
#   file.type  = "normalized_results", # We use the normalized bulk RNA-seq data
#   experimental.strategy = "RNA-Seq",
#   legacy = TRUE
# )
# 
### Upload data from GDC and save to "GDCdata_LUAD"
# GDCdownload(
#   query = query,
#   method = "api",
#   files.per.chunk = 10,
#   directory = "GDCdata_LUAD" # Saved path
# )
# 
# 
# bulk.dat <- GDCprepare(query = query, directory = "GDCdata_LUAD")
#
# clinical.info <- as.data.frame(colData(bulk.dat))
# 
# exprs.dat <- assay(bulk.dat)

The downloaded bulk RNA-seq expression data is in exprs.dat and the clinical information is in clinical.info. For convenience, we provide the pre-processed and dimension-reduced bulk RNA-seq data and its meta data, as loaded below.

library(dplyr)
# Load the pre-processed and dimension-reduced LUAD bulk RNA-seq data
LUAD.bulk.dat <- readRDS(system.file("extdata", "LUAD.bulk.dat.rds", package = "SCIPAC"))
# Load the meta data for LUAD bulk RNA-seq data
LUAD.bulk.meta <- readRDS(system.file("extdata", "LUAD.bulk.meta.rds", package = "SCIPAC")) %>% as.data.frame()

Let’s check the first few rows of the meta data.

head(LUAD.bulk.meta)

The survival information is contained in the three columns LUAD.bulk.meta$vital_status, LUAD.bulk.meta$days_to_last_follow_up, and LUAD.bulk.meta$days_to_death. We further make some adjustment to obtain the input y.

LUAD.bulk.meta$S_time <- ifelse(is.na(LUAD.bulk.meta$days_to_death), 
                               LUAD.bulk.meta$days_to_last_follow_up,
                               LUAD.bulk.meta$days_to_death)
LUAD.bulk.meta$S_status <- ifelse(LUAD.bulk.meta$vital_status == "Dead", 1, 0)

all(rownames(LUAD.bulk.meta) == rownames(LUAD.bulk.dat))

## [1] TRUE

# Delete normal cases
# Delete NAs in survival time
# Delete S_time <= 0
normal.idx <- which(LUAD.bulk.meta$sample_type == "Solid Tissue Normal")
na.idx <- which(is.na(LUAD.bulk.meta$S_time))
leq.zero.idx <- which(LUAD.bulk.meta$S_time <= 0)

LUAD.bulk.meta.sub <- LUAD.bulk.meta[-c(normal.idx, na.idx, leq.zero.idx), ]


LUAD.bulk.dat.sub <- LUAD.bulk.dat[rownames(LUAD.bulk.meta.sub), ]
y <- LUAD.bulk.meta.sub[, c("S_time", "S_status")] %>% as.matrix()
colnames(y) <- c("time", "status")
all(rownames(LUAD.bulk.dat.sub) == rownames(y))

## [1] TRUE

Clustering

As before, we first perform clustering on the scRNA-seq data

LUAD.ct.res <- seurat.ct(LUAD.sc.dat, res = 2.0)

## Computing nearest neighbor graph

## Computing SNN

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 29477
## Number of edges: 1332182
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8547
## Number of communities: 43
## Elapsed time: 5 seconds

Executing SCIPAC

After all the preparation, we can finally run SCIPAC.

SCIPAC.res.LUAD <- SCIPAC(bulk.dat = LUAD.bulk.dat.sub, y = y,
                             family = "cox", ct.res = LUAD.ct.res,
                             ela.net.alpha = 0.4, bt.size = 50,
                             numCores = 7, CI.alpha = 0.05, nfold = 10)

We set family = "cox" for survival input y and use the default values for other parameters.

Visualization of results

Let’s view the results generated by SCIPAC. For convenience, we load a two-dimensional UMAP data to visualize the results.

LUAD.sc.umap <- readRDS(system.file("extdata", "LUAD.sc.umap.rds", package = "SCIPAC"))

We first generate the input for ggplot.

LUAD.sc.meta <- LUAD.sc.meta[rownames(SCIPAC.res.LUAD), ]
plot.dat <- cbind(LUAD.sc.meta, SCIPAC.res.LUAD)
all(rownames(plot.dat) == rownames(LUAD.sc.umap))

## [1] TRUE

plot.dat <- cbind(plot.dat, LUAD.sc.umap)
plot.dat$log.pval.adj <- ifelse((plot.dat$sig == "Not.sig"),
                                0, plot.dat$log.pval)

# Plot the cell types
colors.cell <- tableau_color_pal("Classic 20",
                                 direction = 1)(length(unique(plot.dat$celltype)))
plt <- plot.dat %>%
  ggplot(aes(x = umap.X1, y = umap.X2, col = celltype, fill = celltype)) +
  theme_tufte(base_size = 12) +
  theme(panel.background = element_rect(fill = NA, color = "black")) +
  guides(color = guide_legend(override.aes = list(stroke = 1,
                                                  alpha = 1, shape = 16, size = 4)),
         alpha = FALSE) +
  scale_color_manual(" ", values = colors.cell, guide = "none") +
  scale_fill_manual(" ", values = colors.cell, guide = "none") +
  theme(legend.position= "top") +
  labs(x = "UMAP 1", y = "UMAP 2")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

p1 <- plt + geom_point()

# Control the range of the estimate of Lambda.Psi within -2 and 2
plot.dat$Lambda.adj <- plot.dat$Lambda.est
if(sum(plot.dat$Lambda.est <= -2) > 0){
  plot.dat[which(plot.dat$Lambda.est <= -2), ]$Lambda.adj <- -2
}
if(sum(plot.dat$Lambda.est >= 2) > 0){
  plot.dat[which(plot.dat$Lambda.est >= 2), ]$Lambda.adj <- 2
}

p2 <- ggplot(plot.dat) + 
  geom_point(aes(x=umap.X1, y=umap.X2, col=Lambda.adj)) +
  scale_color_gradient2(expression(Lambda), low="darkblue",high='red', mid = "white", midpoint = 0,
                        limits = c(-2, 2), breaks = c(-2, -1, 0, 1, 2),
                        labels = c(expression(phantom(x) <= -2), -1, 0, 1, expression(phantom(x) >= 2))) +
  theme(legend.title = element_text(family = "sans", size = 27),
        legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1))+ 
  labs(x = "UMAP 1", y = "UMAP 2")

# Control the range of log.pval.adj within -3 and 3
plot.dat$pval.adj.adj <- plot.dat$log.pval.adj
if(sum(plot.dat$log.pval.adj <= -3) > 0){
  plot.dat[which(plot.dat$log.pval.adj <= -3), ]$pval.adj.adj <- -3
}
if(sum(plot.dat$log.pval.adj >= 3) > 0){
  plot.dat[which(plot.dat$log.pval.adj >= 3), ]$pval.adj.adj <- 3
}

plot.dat.positive <- plot.dat[which(plot.dat$pval.adj.adj > 0), ]
plot.dat.positive$pval.adj.adj <- -plot.dat.positive$pval.adj.adj

plot.dat.negative <- plot.dat[which(plot.dat$log.pval.adj <= 0), ]

p3 <- ggplot() + 
  geom_point(data = plot.dat.negative, aes(x=umap.X1, y=umap.X2, col=pval.adj.adj)) +
  scale_color_gradient2('   ', low="darkblue",high='white', mid = "white", midpoint = 0,
                        limits = c(-3, -1.3), breaks = c(-3, -2, -1.3), 
                        labels = c(expression(phantom(x) <= 0.001), 0.01, 0.05),
                        na.value = "white") +
  new_scale_color() + 
  geom_point(data = plot.dat.positive, aes(x=umap.X1, y=umap.X2, col=pval.adj.adj)) +
  scale_color_gradient2("p-values", low="red",high='white', mid = "white", midpoint = 0,
                        limits = c(-3, -1.3), breaks = c(-3, -2, -1.3), 
                        labels = c(expression(phantom(x) <= 0.001), 0.01, 0.05),
                        na.value = "white") + 
  theme(legend.position= "top",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_line(color = "lightgrey", size = 0.1))+ 
  labs(x = "UMAP 1", y = "UMAP 2")

cowplot::plot_grid(p1, p2, p3, nrow = 1)

Session Info

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggnewscale_0.4.8.9000 cowplot_1.1.1         ggthemes_4.2.4       
## [4] ggrepel_0.9.1         ggplot2_3.3.6         uwot_0.1.14          
## [7] Matrix_1.5-1          dplyr_1.0.10          SCIPAC_0.1.0         
## 
## loaded via a namespace (and not attached):
##   [1] Seurat_4.2.0          Rtsne_0.16            colorspace_2.0-3     
##   [4] deldir_1.0-6          ellipsis_0.3.2        ggridges_0.5.4       
##   [7] rstudioapi_0.13       spatstat.data_3.0-0   farver_2.1.1         
##  [10] leiden_0.4.3          listenv_0.8.0         fansi_1.0.3          
##  [13] codetools_0.2-18      splines_4.1.2         cachem_1.0.6         
##  [16] knitr_1.39            polyclip_1.10-0       jsonlite_1.8.2       
##  [19] ica_1.0-3             cluster_2.1.3         png_0.1-7            
##  [22] rgeos_0.5-9           shiny_1.7.2           sctransform_0.3.5    
##  [25] spatstat.sparse_3.0-0 compiler_4.1.2        httr_1.4.4           
##  [28] assertthat_0.2.1      SeuratObject_4.1.2    fastmap_1.1.0        
##  [31] lazyeval_0.2.2        cli_3.4.1             later_1.3.0          
##  [34] htmltools_0.5.3       tools_4.1.2           igraph_1.3.5         
##  [37] gtable_0.3.1          glue_1.6.2            RANN_2.6.1           
##  [40] reshape2_1.4.4        Rcpp_1.0.9            scattermore_0.8      
##  [43] jquerylib_0.1.4       vctrs_0.4.2           nlme_3.1-158         
##  [46] progressr_0.11.0      lmtest_0.9-40         spatstat.random_3.0-1
##  [49] xfun_0.31             stringr_1.4.1         globals_0.16.1       
##  [52] mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.3      
##  [55] irlba_2.3.5.1         goftest_1.2-3         future_1.28.0        
##  [58] MASS_7.3-58           zoo_1.8-11            scales_1.2.1         
##  [61] spatstat.core_2.4-4   promises_1.2.0.1      spatstat.utils_3.0-1 
##  [64] parallel_4.1.2        RColorBrewer_1.1-3    yaml_2.3.5           
##  [67] reticulate_1.26       pbapply_1.5-0         gridExtra_2.3        
##  [70] sass_0.4.2            rpart_4.1.16          stringi_1.7.8        
##  [73] highr_0.9             rlang_1.0.6           pkgconfig_2.0.3      
##  [76] matrixStats_0.62.0    evaluate_0.15         lattice_0.20-45      
##  [79] tensor_1.5            ROCR_1.0-11           purrr_0.3.5          
##  [82] labeling_0.4.2        patchwork_1.1.2       htmlwidgets_1.5.4    
##  [85] tidyselect_1.2.0      parallelly_1.32.1     RcppAnnoy_0.0.19     
##  [88] plyr_1.8.7            magrittr_2.0.3        R6_2.5.1             
##  [91] generics_0.1.3        DBI_1.1.3             withr_2.5.0          
##  [94] mgcv_1.8-40           pillar_1.8.1          fitdistrplus_1.1-8   
##  [97] survival_3.3-1        abind_1.4-5           sp_1.5-0             
## [100] tibble_3.1.8          future.apply_1.9.1    KernSmooth_2.23-20   
## [103] utf8_1.2.2            spatstat.geom_3.0-3   plotly_4.10.0        
## [106] rmarkdown_2.19.2      grid_4.1.2            data.table_1.14.4    
## [109] digest_0.6.29         xtable_1.8-4          tidyr_1.2.1          
## [112] httpuv_1.6.6          munsell_0.5.0         viridisLite_0.4.1    
## [115] bslib_0.4.0

SCIPAC Tutorial

Introduction

SCIPAC examples

Applying SCIPAC to data with a binary phenotype

Preparing scRNA-seq and bulk RNA-seq data

Clustering

Executing SCIPAC to identify phenotype-associated cells

Visualization of results

Applying SCIPAC to data with an ordinal phenotype

Clustering

Executing SCIPAC

Visualization of results

Applying SCIPAC to data with survival information

Working with TCGA database

Clustering

Executing SCIPAC

Visualization of results

Session Info