library(scater)
library(scran)
library(SingleCellExperiment)
library(ggplot2)
library(ggridges)
library(RColorBrewer)
library(ggthemes)
library(dplyr)
library(Rtsne)
library(dynamicTreeCut)
library(destiny)
library(rgl)
library(ggthemes)
library(ComplexHeatmap)
library(circlize)
col_list = c(brewer.pal(8,"Set1"), brewer.pal(8,"Set2"), brewer.pal(12,"Set3"), brewer.pal(12,"Paired"),
brewer.pal(9,"Pastel1"), brewer.pal(8,"Pastel2"), brewer.pal(8,"Dark2"), brewer.pal(8,"Accent"))
dir <- "/Users/ibarra01/OneDrive - CRUK Cambridge Institute/Conferences/CONTRA-2018/tutorial/"
Today we will continue the data analysis workflow that we started yesterday. Up to now, we have performed quality control and normalised the data. So let’s load the SingleCellExpression object with the normalised counts from yesterday.
sce_norm <- readRDS(paste0(dir,"nestorowa_normcounts_goodQuality.Rds"))
In order to better understand the data’s structure it is useful to focus on the most highly variable genes (HVGs). To do this properly we need to model the technical noise. This can be accomplished by using the ERCC spike-ins.
Spike-in transcripts are only affected by technical noise, thus allowing to estimate its magnitude.
One of the first methods developed (Brennecke et al. 2013) models the mean-cv2 trend based on the spike-ins. Highly variable genes are defined as those that show significantly higher variation than can be accounted by the technical noise.
An alternative method implemented in scran performs a similar analysis but instead models the mean-variance trend of the log counts.
For datasets lacking spike-in transcripts it is still possible to model the mean-variance relationship from the endogenous genes, under the assumption that most genes are not highly variable and thus mostly reflect technical noise.
So let’s apply this method to our data.
fit <- trendVar(sce_norm, use.spikes = TRUE)
var_decomp <- decomposeVar(sce_norm, fit)
plot(var_decomp$mean, var_decomp$total, pch=16, cex=0.6, xlab="Mean log-expression", ylab="Variance of log-expression", bty="l")
curve(fit$trend(x), col="dodgerblue", lwd=2, add=TRUE)
spikes <- isSpike(sce_norm)
points(var_decomp$mean[spikes], var_decomp$total[spikes], col="red", pch=16)
In this case, we identify these many HVGs (FDR < 5%).
hvgs <- rownames(var_decomp[!(is.na(var_decomp$FDR)) & var_decomp$FDR < 0.05,])
length(hvgs)
## [1] 2402
plot(var_decomp$mean, var_decomp$total, pch=16, cex=0.6, xlab="Mean log-expression", ylab="Variance of log-expression", bty="l")
curve(fit$trend(x), col="red", lwd=2, add=TRUE)
spikes <- isSpike(sce_norm)
points(var_decomp$mean[spikes], var_decomp$total[spikes], col="red", pch=16)
points(var_decomp[hvgs,]$mean, var_decomp[hvgs,]$total, col="dodgerblue3", pch=16)
And we can see that their expression values are indeed quite different across cells.
plotExpression(sce_norm, features=rownames(var_decomp[order(var_decomp$bio, decreasing = TRUE),])[1:10])
We can now create a new SingleCellExperiment object containing only these genes.
sce_hvg <- sce_norm[hvgs,]
sce_hvg
## class: SingleCellExperiment
## dim: 2402 1656
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(2402): ENSMUSG00000000001 ENSMUSG00000000154 ...
## ENSMUSG00000107061 ENSMUSG00000107283
## rowData names(14): ENSEMBL SYMBOL ... n_cells_counts
## pct_dropout_counts
## colnames(1656): HSPC_025 HSPC_031 ... Prog_846 Prog_810
## colData names(47): Cell.Name sequencing_lane ...
## log10_total_features_ERCC replicate
## reducedDimNames(0):
## spikeNames(1): ERCC
This set of HVGs should be more informative to characterise the structure of the dataset. Again, we can visualise this via dimensionality reduction techniques.
pcs = prcomp(t(logcounts(sce_hvg)))
df = as.data.frame(cbind(colData(sce_hvg), pcs$x[,1:10]))
perc_var <- round(summary(pcs)$importance[2,]*100, 1)
set.seed(86)
tsne <- Rtsne(t(logcounts(sce_hvg)))
plots <- list()
plots[[1]] <- ggplot(df, aes(x = PC1, y = PC2, colour = plate_preparation_day)) +
geom_point(alpha = 0.6) +
xlab(paste0(perc_var[1],"% variance explained")) +
ylab(paste0(perc_var[2],"% variance explained"))
df <- cbind(data.frame(comp1 = tsne$Y[,1], comp2 = tsne$Y[,2]), as.data.frame(colData(sce_hvg)))
plots[[2]] <- ggplot(df, aes(x = comp1, y = comp2, colour = plate_preparation_day)) + geom_point(alpha = 0.4)
multiplot(plotlist = plots, cols=2)
From the dimensionality reduction plots above, we can observe different populations, even though these are not very well separated. Clustering allows to define the different groups observed in the data.
This is usually done based on the similarity of the transcriptomes as measured by the distance between cells, or by identifying regions of high cell density.
There are many different approaches to do this and each is better suited to different types of data.
Andrews and Hemberg, Mol Aspects Med., 2018 (doi.org/10.1016/j.mam.2017.07.002)
For the data we are working with, we will reproduce the method reported in the paper. The distance between cells is measured based on Spearman’s rank correlation and hierarchical clustering is used to cluster the cells.
cor.hvg <- cor(logcounts(sce_hvg), method="spearman") #compute correlation matrix
dissimilarity <- 0.5*(1-cor.hvg)
distanceMetric <- as.dist(dissimilarity)
h_clustering <- hclust(distanceMetric, method = "average")
plot(h_clustering)
A dynamic tree cut algorithm is used to find the different clusters.
clusters <- cutreeDynamic(h_clustering, distM = as.matrix(distanceMetric), minClusterSize = 10, method = "hybrid", deepSplit = 2, verbose = 0)
names(clusters) <- colnames(sce_hvg)
table(clusters)
## clusters
## 0 1 2 3 4 5 6 7
## 1 490 430 248 246 160 56 25
So now we can colour the dimensionality reduction plots based on these clusters.
plots <- list()
df = as.data.frame(cbind(colData(sce_hvg), pcs$x[,1:10]))
plots[[1]] <- ggplot(df, aes(x = PC1, y = PC2, colour = as.factor(clusters))) +
geom_point(alpha = 0.6) +
xlab(paste0(perc_var[1],"% variance explained")) +
ylab(paste0(perc_var[2],"% variance explained"))
df <- cbind(data.frame(comp1 = tsne$Y[,1], comp2 = tsne$Y[,2]), as.data.frame(colData(sce_hvg)))
plots[[2]] <- ggplot(df, aes(x = comp1, y = comp2, colour = as.factor(clusters))) + geom_point(alpha = 0.4)
multiplot(plotlist = plots, cols=2)
For dataset with discrete clusters, for example different cell types, it is often useful to identify the genes that are differentially expressed between them. Differential expression analysis of RNA-seq data is usually performed using linear models.
The result from such an analysis is an estimation of the difference in expression between conditions of interest (fold-change) and an associated p-value indicating whether such a difference is significant or not.
In this case, we use the function findMarkers from scran to perform pairwise tests between all the clusters we’ve just defined.
## the dynamic tree cut algorithm identified one outlier cell (tagged with 0). We remove it from downstream analyses.
outlier <- which(clusters==0)
sce_hvg <- sce_hvg[,-outlier]
clusters <- clusters[-outlier]
## add the cluster classification to the sce object
sce_hvg$cluster <- as.factor(clusters)
markerGenes <- findMarkers(counts(sce_hvg), cluster=sce_hvg$cluster, direction = "up")
names(markerGenes)
## [1] "1" "2" "3" "4" "5" "6" "7"
head(markerGenes[["1"]])
## DataFrame with 6 rows and 8 columns
## Top FDR logFC.2
## <integer> <numeric> <numeric>
## ENSMUSG00000040274 1 3.84072080839548e-85 2028.505505458
## ENSMUSG00000054428 1 2.19306879260035e-69 2283.22316089226
## ENSMUSG00000037336 1 6.86716296779172e-69 1114.22130991932
## ENSMUSG00000004655 1 1.45259924125632e-53 1405.08115804461
## ENSMUSG00000073002 2 2.49978307631048e-71 995.202040816327
## ENSMUSG00000054717 2 3.15824595565378e-66 769.670526815377
## logFC.3 logFC.4 logFC.5
## <numeric> <numeric> <numeric>
## ENSMUSG00000040274 2849.0116194865 -360.393114318898 1380.09387755102
## ENSMUSG00000054428 2419.16331468071 1644.85433880869 2076.66734693878
## ENSMUSG00000037336 1252.99028966425 1200.97776671644 1238.4306122449
## ENSMUSG00000004655 1431.6159479921 1430.44682263149 1428.94961734694
## ENSMUSG00000073002 1029.49962146149 1031.36301642608 982.920790816327
## ENSMUSG00000054717 828.555579328506 427.196183839389 542.704974489796
## logFC.6 logFC.7
## <numeric> <numeric>
## ENSMUSG00000040274 1321.07959183673 2998.85387755102
## ENSMUSG00000054428 2206.45306122449 2534.14734693878
## ENSMUSG00000037336 1200.47704081633 1254.3506122449
## ENSMUSG00000004655 1432.47193877551 1436.53836734694
## ENSMUSG00000073002 1056.7306122449 1188.66204081633
## ENSMUSG00000054717 55.7969387755103 853.061224489796
A good way of looking at some of the most differentially expressed genes is to create a heatmap of their expression pattern across clusters.
sce_hvg <- sce_hvg[,order(sce_hvg$cluster)]
## we select the top DE genes to plot
genesToShow <- unique(unlist(sapply(names(markerGenes), function(x) rownames(markerGenes[[x]][markerGenes[[x]]$Top <= 3,]))))
## get expression data for those genes
data = logcounts(sce_hvg)
mat = data[rownames(data) %in% unique(genesToShow),]
# change row names to gene symbols
rownames(mat) <- rowData(sce_hvg)$SYMBOL[rownames(data) %in% unique(genesToShow)]
# scale the expression values
mat_scaled = t(apply(mat, 1, scale))
# we will add gene names to the most DE genes
genesToName <- unique(unlist(sapply(names(markerGenes), function(x) rownames(markerGenes[[x]][markerGenes[[x]]$Top <= 1,]))))
genesToName <- rowData(sce_hvg)[rownames(data) %in% unique(genesToName),]$SYMBOL
subset = which(rownames(mat) %in% unique(genesToName))
## annotation of columns
ha_column = HeatmapAnnotation(df = data.frame(cluster = sce_hvg$cluster), # annotation 1
col = list(cluster = c("1" = col_list[1], "2" = col_list[2], "3" = col_list[3], "4" = col_list[4], "5" = col_list[5], "6" = col_list[6], "7" = col_list[7])))
options(repr.plot.width = 8, repr.plot.height = 6)
Heatmap(mat_scaled, show_column_names = FALSE, show_row_names = FALSE, show_row_dend = TRUE,
name = "expression",
col = colorRamp2(c(-3, 0, 3), c("royalblue", "white", "tomato")),
top_annotation = ha_column,
cluster_columns = FALSE) +
rowAnnotation(link = row_anno_link(at = subset, labels = genesToName), width = unit(1, "cm") + max_text_width(labels))
Alternative approaches have been developed to analyse data that captures dynamic process, like differentiation. In these cases, the cells do not form discrete clusters, but instead a continuous path (or several) with cells transitioning from one cell state to another in progressive steps.
Haghverdi et al., Bioinformatics, 2013 (doi.org/10.1093/bioinformatics/btv325)
Diffusion maps have been successfully used to reduce the dimensionality of these datasets, and are capable of preserving the continuous dynamics of the data.
data <- logcounts(sce_hvg)
diffMap <- DiffusionMap(t(data), sigma = "local", distance = "euclidean")
dm <- as.data.frame(diffMap@eigenvectors[,1:8])
row.names(dm) <- colnames(data)
plotids3 <- with(dm, plot3d(DC1, DC3, DC2, type = "s", size = 1, xlab = "DC1",
ylab = "DC3", zlab = "DC2", col = sce_hvg$cluster))
rglwidget(elementId = "plot3drgl3", width = 600, height = 600)
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] circlize_0.4.4 ComplexHeatmap_1.18.1
## [3] rgl_0.99.16 destiny_2.10.2
## [5] dynamicTreeCut_1.63-1 Rtsne_0.13
## [7] dplyr_0.7.6 ggthemes_4.0.1
## [9] RColorBrewer_1.1-2 ggridges_0.5.0
## [11] scran_1.8.4 scater_1.8.4
## [13] SingleCellExperiment_1.2.0 SummarizedExperiment_1.10.1
## [15] DelayedArray_0.6.6 BiocParallel_1.14.2
## [17] matrixStats_0.54.0 GenomicRanges_1.32.6
## [19] GenomeInfoDb_1.16.0 IRanges_2.14.11
## [21] S4Vectors_0.18.3 ggplot2_3.0.0
## [23] Biobase_2.40.0 BiocGenerics_0.26.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 colorspace_1.3-2
## [3] RcppEigen_0.3.3.4.0 rjson_0.2.20
## [5] class_7.3-14 rio_0.5.10
## [7] rprojroot_1.3-2 XVector_0.20.0
## [9] GlobalOptions_0.1.0 proxy_0.4-22
## [11] rstudioapi_0.7 DT_0.4
## [13] tximport_1.8.0 robustbase_0.93-2
## [15] knitr_1.20 jsonlite_1.5
## [17] shinydashboard_0.7.0 shiny_1.1.0
## [19] compiler_3.5.0 backports_1.1.2
## [21] assertthat_0.2.0 Matrix_1.2-14
## [23] lazyeval_0.2.1 limma_3.36.3
## [25] later_0.7.4 htmltools_0.3.6
## [27] tools_3.5.0 bindrcpp_0.2.2
## [29] igraph_1.2.2 gtable_0.2.0
## [31] glue_1.3.0 GenomeInfoDbData_1.1.0
## [33] reshape2_1.4.3 Rcpp_0.12.18
## [35] carData_3.0-1 cellranger_1.1.0
## [37] crosstalk_1.0.0 DelayedMatrixStats_1.2.0
## [39] lmtest_0.9-36 laeken_0.4.6
## [41] stringr_1.3.1 openxlsx_4.1.0
## [43] miniUI_0.1.1.1 mime_0.5
## [45] statmod_1.4.30 edgeR_3.22.3
## [47] DEoptimR_1.0-8 zoo_1.8-3
## [49] zlibbioc_1.26.0 MASS_7.3-50
## [51] scales_1.0.0 VIM_4.7.0
## [53] hms_0.4.2 promises_1.0.1
## [55] rhdf5_2.24.0 yaml_2.2.0
## [57] curl_3.2 gridExtra_2.3
## [59] stringi_1.2.4 highr_0.7
## [61] e1071_1.7-0 TTR_0.23-3
## [63] manipulateWidget_0.10.0 boot_1.3-20
## [65] zip_1.0.0 shape_1.4.4
## [67] rlang_0.2.2 pkgconfig_2.0.2
## [69] bitops_1.0-6 evaluate_0.11
## [71] lattice_0.20-35 purrr_0.2.5
## [73] Rhdf5lib_1.2.1 bindr_0.1.1
## [75] labeling_0.3 htmlwidgets_1.2
## [77] tidyselect_0.2.4 plyr_1.8.4
## [79] magrittr_1.5 R6_2.2.2
## [81] pillar_1.3.0 haven_1.1.2
## [83] foreign_0.8-71 withr_2.1.2
## [85] xts_0.10-2 scatterplot3d_0.3-41
## [87] abind_1.4-5 RCurl_1.95-4.11
## [89] sp_1.3-1 nnet_7.3-12
## [91] tibble_1.4.2 crayon_1.3.4
## [93] car_3.0-2 rmarkdown_1.10
## [95] GetoptLong_0.1.7 viridis_0.5.1
## [97] locfit_1.5-9.1 readxl_1.1.0
## [99] data.table_1.11.4 FNN_1.1.2.1
## [101] forcats_0.3.0 webshot_0.5.0
## [103] vcd_1.4-4 digest_0.6.16
## [105] xtable_1.8-3 httpuv_1.4.5
## [107] munsell_0.5.0 beeswarm_0.2.3
## [109] viridisLite_0.3.0 smoother_1.1
## [111] vipor_0.4.5