Yesterday we QCed and normalised the data. We will continue the analysis of the same data. So first, load the normalised data that you saved yesterday.
## load QCed and normalised data
sce_norm <- readRDS(paste0(dir, "E-MTAB-6153_normCounts_GQ.Rds"))
As you know, this dataset doesn’t contain ERCC spike-ins. Therefore, we cannot use them to estimate the technical noise.
We have discussed that the trend could be fitted to all endogenous genes assuming that most are not highly variable. If you consider that this dataset contains cells from whole embryos that are at a stage where many different lineages are being specified, do you think this assumption is valid?
A different approach to define the most highly variable genes was proposed by Kolodziejczyk et al., 2015 (doi.org/10.1016/j.stem.2015.09.011). They have defined a statistic referred to as the distance-to-median (DM). They fit a trend to the log cv2 as a function of the log mean using running averages (sliding windows). The DM is a measurement of the variability of the genes after accounting for the mean variance relationship. This method is implemented in the DM function in scran.
As reported in the paper, you will need to obtain the normalised counts and in this case, they should not be on log scale. Hint: remember the normalised counts you obtain from the SingleCellExperiment object are in log base 2.
Then, apply the DM function on all the genes with a mean expression of at least 0.1. Question: why do we remove genes expressed below this threshold?
data <- 2^logcounts(sce_norm)-1
data <- data[rowMeans(data)>0.1,]
means <- rowMeans(data)
cv2 <- apply(data, 1, var)/means^2
hvgs2 <- DM(means, cv2)
Now take the genes with the top 20% DM values, and define these as your set of HVGs.
hvgs2 <- hvgs2[order(hvgs2, decreasing = TRUE)]
hvgs2 <- hvgs2[which(hvgs2 >= quantile(hvgs2, 0.8))]
length(hvgs2)
## [1] 1643
sce_hvg <- sce_norm[names(hvgs2),]
sce_hvg
## class: SingleCellExperiment
## dim: 1643 3990
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1643): ENSMUSG00000052217 ENSMUSG00000020469 ...
## ENSMUSG00000037211 ENSMUSG00000040652
## rowData names(3): ENSEMBL SYMBOL CHR
## colnames(3990): sample24.1 sample24.8 ... sample28.4725
## sample28.4728
## colData names(7): cell collection ... nGenes mt
## reducedDimNames(0):
## spikeNames(0):
Have a look at some of the genes that are defined as highly variable. Has the method done a good job at identifying variable genes?
plotExpression(sce_hvg, features=rownames(sce_hvg)[1:10])
Now that we have defined the set of HVGs, we can cluster the data into distinct groups. Using the expression data for the HVGs, create a dissimilarity matrix and apply hierarchical clustering.
Have a look at the dendrogram. How would you describe the structure of the dataset?
cor.hvg <- cor(logcounts(sce_hvg), method="spearman") #compute correlation matrix
test.dissim <- sqrt(0.5*((1-cor.hvg)))
test.dist <- as.dist(test.dissim)
h_clustering <- hclust(test.dist, method = "average")
plot(h_clustering)
Use the dynamic tree cut algorithm to define the clusters based on this dendrogram and the distance matrix. Question: during the tutorial we used a minimum cluster size (parameter minClusterSize) of 10. What do you think is an appropriate number for this parameter in this dataset?
clusters <- cutreeDynamic(h_clustering, distM = as.matrix(test.dist), minClusterSize = 40, method = "hybrid", deepSplit = 1, verbose = 0)
names(clusters) <- colnames(sce_hvg)
table(clusters)
## clusters
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 898 544 412 382 361 345 339 215 154 130 87 68 55
Optional challenge: change the minimum cluster size and evaluate how the definition of the clusters changes.
Now, we can use the distance matrix directly as input into the Rtsne function. Have a look at the help page of this function and check whether you should adjust any parameters! Then plot the t-SNE coloured by the clusters we’ve just defined.
set.seed(0)
tsne <- Rtsne(test.dist, is_distance=TRUE)
df <- cbind(data.frame(comp1 = tsne$Y[,1], comp2 = tsne$Y[,2]), as.data.frame(colData(sce_hvg)))
ggplot(df, aes(x = comp1, y = comp2, colour = as.factor(clusters))) + geom_point(alpha = 0.4)
Finally, use differential expression analysis to identify the genes that are expressed differently between clusters.
sce_hvg$cluster <- as.factor(clusters)
markerGenes <- findMarkers(counts(sce_hvg), cluster=sce_hvg$cluster, direction = "up")
names(markerGenes)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
head(markerGenes[["1"]])
## DataFrame with 6 rows and 14 columns
## Top FDR logFC.2
## <integer> <numeric> <numeric>
## ENSMUSG00000032294 1 3.18504196477809e-223 12.9192240927551
## ENSMUSG00000072235 1 7.63923962059206e-209 1.9728317830473
## ENSMUSG00000031432 1 2.45088971406608e-166 0.578614240796541
## ENSMUSG00000029838 1 3.47446448377189e-135 6.2890819468099
## ENSMUSG00000074637 1 5.95050061318295e-108 1.54145241058561
## ENSMUSG00000004885 1 3.72994257569216e-105 8.32626424734705
## logFC.3 logFC.4 logFC.5
## <numeric> <numeric> <numeric>
## ENSMUSG00000032294 30.7659902263931 52.053020674215 15.4141829488738
## ENSMUSG00000072235 12.1426633078902 13.0047458575776 4.12475862026418
## ENSMUSG00000031432 -0.0778320755940922 2.63807297193298 0.52662426197953
## ENSMUSG00000029838 9.84380067896296 10.1391165941767 6.96008365774359
## ENSMUSG00000074637 1.53540229636517 1.51940321132476 1.29000117219552
## ENSMUSG00000004885 8.75380565225853 7.32899753961683 4.0416993133402
## logFC.6 logFC.7 logFC.8
## <numeric> <numeric> <numeric>
## ENSMUSG00000032294 13.7691068719538 21.6704278928593 30.0181126016471
## ENSMUSG00000072235 3.83939188534909 4.27668828139885 -10.5438234837106
## ENSMUSG00000031432 0.764281333720668 1.32636931627806 1.81638783860776
## ENSMUSG00000029838 7.26025628611084 9.67105531137697 8.73150670741182
## ENSMUSG00000074637 0.733326877763791 1.43764905295938 1.45230227378671
## ENSMUSG00000004885 0.0586811271424423 8.44267497092851 8.30256383695033
## logFC.9 logFC.10 logFC.11
## <numeric> <numeric> <numeric>
## ENSMUSG00000032294 30.1085384548636 29.8726743189995 7.38469907585183
## ENSMUSG00000072235 2.51093338732537 -0.980675004283022 -4.24716483629009
## ENSMUSG00000031432 1.67081248373008 2.43964365256125 0.977574687044006
## ENSMUSG00000029838 9.89426720273046 9.10395751242076 9.53861710570105
## ENSMUSG00000074637 1.56033610042519 1.3750214151105 1.46785961139697
## ENSMUSG00000004885 8.72017181037226 8.27841356861402 7.78831630955124
## logFC.12 logFC.13
## <numeric> <numeric>
## ENSMUSG00000032294 25.0552535045199 22.0384085847337
## ENSMUSG00000072235 12.3919494301061 5.88366066005264
## ENSMUSG00000031432 2.01611424079654 1.27600728892488
## ENSMUSG00000029838 10.2265819468099 10.1319295403928
## ENSMUSG00000074637 1.26755535176209 1.53306337315246
## ENSMUSG00000004885 8.81891130617058 8.85393804413849
To plot a heatmap of the top differentially expressed genes, you can use the plotHeatmap function. Have a look at the documentation to find the arguments it requires.
Optional challenge: can you think of a way to better select the genes that are shown in the heatmap, to extract more information about what characterises each cluster?
genesToShow <- c()
for(i in names(markerGenes)){
tmp <- as.data.frame(markerGenes[[i]])
tmp$count <- sapply(1:nrow(tmp), function(x) sum(tmp[x,-c(1:2)]>2))
tmp <- tmp[order(tmp$count, decreasing = TRUE),]
genesToShow <- c(genesToShow, row.names(tmp)[1:5])
}
genesToShow <- unique(genesToShow)
plotHeatmap(sce_hvg, features=genesToShow, columns=order(sce_hvg$cluster),
colour_columns_by=c("cluster"), show_rownames = FALSE, show_colnames = FALSE,
cluster_cols=FALSE, center=TRUE, symmetric=TRUE, zlim=c(-5, 5))
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dynamicTreeCut_1.63-1 Rtsne_0.13
## [3] RColorBrewer_1.1-2 biomaRt_2.36.1
## [5] scater_1.8.4 ggplot2_3.0.0
## [7] scran_1.8.4 SingleCellExperiment_1.2.0
## [9] SummarizedExperiment_1.10.1 DelayedArray_0.6.6
## [11] BiocParallel_1.14.2 matrixStats_0.54.0
## [13] Biobase_2.40.0 GenomicRanges_1.32.6
## [15] GenomeInfoDb_1.16.0 IRanges_2.14.11
## [17] S4Vectors_0.18.3 BiocGenerics_0.26.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] httr_1.3.1 progress_1.2.0
## [5] rprojroot_1.3-2 tools_3.5.0
## [7] backports_1.1.2 R6_2.2.2
## [9] DT_0.4 vipor_0.4.5
## [11] DBI_1.0.0 lazyeval_0.2.1
## [13] colorspace_1.3-2 withr_2.1.2
## [15] prettyunits_1.0.2 tidyselect_0.2.4
## [17] gridExtra_2.3 bit_1.1-14
## [19] compiler_3.5.0 labeling_0.3
## [21] scales_1.0.0 stringr_1.3.1
## [23] digest_0.6.16 rmarkdown_1.10
## [25] XVector_0.20.0 pkgconfig_2.0.2
## [27] htmltools_0.3.6 limma_3.36.3
## [29] htmlwidgets_1.2 rlang_0.2.2
## [31] RSQLite_2.1.1 rstudioapi_0.7
## [33] FNN_1.1.2.1 shiny_1.1.0
## [35] DelayedMatrixStats_1.2.0 bindr_0.1.1
## [37] dplyr_0.7.6 RCurl_1.95-4.11
## [39] magrittr_1.5 GenomeInfoDbData_1.1.0
## [41] Matrix_1.2-14 Rcpp_0.12.18
## [43] ggbeeswarm_0.6.0 munsell_0.5.0
## [45] Rhdf5lib_1.2.1 viridis_0.5.1
## [47] stringi_1.2.4 yaml_2.2.0
## [49] edgeR_3.22.3 zlibbioc_1.26.0
## [51] rhdf5_2.24.0 plyr_1.8.4
## [53] blob_1.1.1 grid_3.5.0
## [55] promises_1.0.1 shinydashboard_0.7.0
## [57] crayon_1.3.4 lattice_0.20-35
## [59] hms_0.4.2 locfit_1.5-9.1
## [61] knitr_1.20 pillar_1.3.0
## [63] igraph_1.2.2 rjson_0.2.20
## [65] reshape2_1.4.3 XML_3.98-1.16
## [67] glue_1.3.0 evaluate_0.11
## [69] data.table_1.11.4 httpuv_1.4.5
## [71] gtable_0.2.0 purrr_0.2.5
## [73] assertthat_0.2.0 mime_0.5
## [75] xtable_1.8-3 later_0.7.4
## [77] viridisLite_0.3.0 pheatmap_1.0.10
## [79] tibble_1.4.2 memoise_1.1.0
## [81] AnnotationDbi_1.42.1 beeswarm_0.2.3
## [83] tximport_1.8.0 bindrcpp_0.2.2
## [85] statmod_1.4.30