In the following R markdown document you will have the chance to perform similar analyses to those demonstrated in the tutorial. In this case, however, you will be analysing data generated with a droplet-based platform (10X Genomics).
Question: what differences do you expect in this data compared to the plate-based dataset we’ve been using for the tutorial?
This dataset is from “Defining murine organogenesis at single-cell resolution reveals a role for the leukotriene pathway in regulating blood progenitor formation” from Ibarra-Soria et al. (2018).
The data consists of cells from mouse embryos staged at the beginning of organogenesis. This is the point during embryonic development when all the cell lineages that will give rise to all organs are specified. The raw data have been deposited in the ArrayExpress database under accession E-MTAB-6153 (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6153/) and the raw count table can be downloaded as processed data.
The dataset consists of over 20,000 cells. To keep computing time reasonable we have selected a random subset of cells for you to work with.
First, load the count table and associated metadata. As before, we have provided the count table as a binary compressed RDS file to accelerate loading.
counts <- readRDS(paste0(dir, "E-MTAB-6153_rawCounts.Rds"))
meta <- read.table(paste0(dir, "E-MTAB-6153_metadata.tsv"), header = TRUE, stringsAsFactors = FALSE)
The experimental design is simple. There are two biological replicates, each consisting of independent pools of embryos, each collected on a separate day. Pool1 was then processed across two different channels of a 10X chip (technical replicates) whereas pool2 was run on a single channel.
How many cells and genes are present in this data? Have a look at the row and column annotations.
dim(counts)
## [1] 18853 4000
head(colnames(counts))
## [1] "sample24.1" "sample24.8" "sample24.13" "sample24.15" "sample24.20"
## [6] "sample24.24"
head(row.names(counts))
## [1] "ENSMUSG00000051951" "ENSMUSG00000025902" "ENSMUSG00000033845"
## [4] "ENSMUSG00000025903" "ENSMUSG00000104217" "ENSMUSG00000033813"
Now use both the counts and metadata to create a SingleCellExperiment object.
sce <- SingleCellExperiment(assays = list(counts = counts), colData = meta)
Before starting the analysis, collect the gene names for all rows and their chromosomal location. Store these in the rowData of the object. Hint: The paper reports the data was analysed using Ensembl version 84; that corresponds to database mar2016.archive.ensembl.org.
ensembl <- useMart(host='mar2016.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='mmusculus_gene_ensembl')
geneInfo <- suppressMessages(getBM(filters = "ensembl_gene_id", values = row.names(sce), attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = ensembl))
geneInfo <- geneInfo[match(row.names(sce), geneInfo$ensembl_gene_id),]
stopifnot(identical(grep("^ENS", row.names(sce), value = TRUE), geneInfo$ensembl_gene_id))
rowData(sce)$ENSEMBL <- geneInfo$ensembl_gene_id
rowData(sce)$SYMBOL <- geneInfo$external_gene_name
rowData(sce)$CHR <- geneInfo$chromosome_name
head(rowData(sce))
## DataFrame with 6 rows and 3 columns
## ENSEMBL SYMBOL CHR
## <character> <character> <character>
## 1 ENSMUSG00000051951 Xkr4 1
## 2 ENSMUSG00000025902 Sox17 1
## 3 ENSMUSG00000033845 Mrpl15 1
## 4 ENSMUSG00000025903 Lypla1 1
## 5 ENSMUSG00000104217 Gm37988 1
## 6 ENSMUSG00000033813 Tcea1 1
Before starting the analysis of the data it is necessary to remove any samples of poor quality. Question: why do you think it is important to QC the dataset? How can you assess quality?
If necessary, define spike-in controls.
## droplet-based technologies like 10X do not incorporate spike-ins.
And now compute the quality metrics. Challenge: try calculating the QC metrics yourself, instead of using scater’s functions. Then add them to the object’s colData (be careful not to loose the values already stored!).
mito_genes <- grepl("MT",rowData(sce)$CHR)
qc <- data.frame(libSize = colSums(counts(sce)),
nGenes = apply(counts(sce), 2, function(x) sum(x>0)),
mt = colSums(counts(sce)[mito_genes,])/colSums(counts(sce))*100)
colData(sce) <- cbind(colData(sce), qc)
colData(sce)
## DataFrame with 4000 rows and 7 columns
## cell collection embryoPool channel libSize
## <character> <character> <character> <character> <numeric>
## sample24.1 sample24.1 day1 pool1 channel1 18405
## sample24.8 sample24.8 day1 pool1 channel1 5734
## sample24.13 sample24.13 day1 pool1 channel1 8179
## sample24.15 sample24.15 day1 pool1 channel1 18602
## sample24.20 sample24.20 day1 pool1 channel1 16755
## ... ... ... ... ... ...
## sample28.4713 sample28.4713 day2 pool2 channel3 16380
## sample28.4716 sample28.4716 day2 pool2 channel3 24832
## sample28.4720 sample28.4720 day2 pool2 channel3 15639
## sample28.4725 sample28.4725 day2 pool2 channel3 7645
## sample28.4728 sample28.4728 day2 pool2 channel3 19819
## nGenes mt
## <integer> <numeric>
## sample24.1 4087 0.635696821515892
## sample24.8 2184 0.837111963725148
## sample24.13 2677 0.721359579410686
## sample24.15 4033 0.58595849908612
## sample24.20 4164 1.13398985377499
## ... ... ...
## sample28.4713 3506 0.262515262515263
## sample28.4716 4055 0.116784793814433
## sample28.4720 3634 0.294136453737451
## sample28.4725 2495 0.680183126226292
## sample28.4728 3946 0.216963519854685
plots <- list()
plots[[1]] <- plotColData(sce, y = "libSize", x = "channel")
plots[[2]] <- plotColData(sce, y = "mt", x = "channel")
plots[[3]] <- plotColData(sce, y = "nGenes", x = "channel")
multiplot(plotlist = plots, cols=2)
The authors discarded all samples that had less than 1,000 detected genes or more than 3% of the reads mapped to mitochondrial genes. Do these filters seem reasonable?
par(mfrow=c(1,2))
plot(density(colData(sce)$nGenes), main="number of genes detected")
abline(v=1000, col="red", lty=2)
plot(density(colData(sce)$mt), main="% in mitochondrial")
abline(v=3, col="red", lty=2)
Apply these filters and remove bad quality cells; also filter any genes that are now not expressed in any cell. What are the dimensions of the new dataset?
badQual <- which(colData(sce)$nGenes < 1000 | colData(sce)$mt > 3)
names(badQual) <- colnames(sce)[badQual]
badQual
## sample24.429 sample24.1096 sample24.2429 sample24.3298 sample24.3846
## 87 205 464 646 747
## sample24.4551 sample24.7653 sample25.2239 sample25.2781 sample28.2877
## 874 1440 1892 1984 3619
sce.GQ <- sce[,-badQual]
sce.GQ <- sce.GQ[rowMeans(counts(sce.GQ))>0,]
dim(sce.GQ)
## [1] 18852 3990
Finally, save the object containing the good-quality samples for future use.
saveRDS(sce.GQ, file=paste0(dir, "E-MTAB-6153_rawCounts_GQ.Rds"))
The next step is to normalise the data so that the counts are comparable across cells. We will use again the deconvolution method proposed by Lun et al. (2016). Can you remember the steps needed to do this? What is achieved through each step?
Have a look at the documentation of the functions used for normalisation. The parameter min.mean is set to 1 by default. Is this threshold appropriate for this dataset?
meanCounts <- calcAverage(sce.GQ, use_size_factors=FALSE)
hist(log10(meanCounts), breaks = 100, main="", col = "grey", xlab=expression(log[10]~"average count"))
abline(v=log10(c(1, 0.1)), col=c("red", "blue"))
Calculate the size factors adjusting the necessary parameters. Hint: the igraph method is more appropriate for large datasets. Still, this step might take a minute or so.
How can you assess if the size factor estimation was successful?
clusters <- quickCluster(sce.GQ, method = "igraph", min.mean = 0.1)
sce.GQ <- computeSumFactors(sce.GQ, cluster = clusters)
plot(colSums(counts(sce.GQ))/1e03, sizeFactors(sce.GQ), xlab = "total counts (x 1000)", ylab = "size factors", bty="l")
Finally, apply the size factors to the raw counts to generate normalised counts and save the resulting object for future use.
sce.GQ <- normalise(sce.GQ)
sce.GQ
## class: SingleCellExperiment
## dim: 18852 3990
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(18852): ENSMUSG00000051951 ENSMUSG00000025902 ...
## ENSMUSG00000063897 ENSMUSG00000095742
## rowData names(3): ENSEMBL SYMBOL CHR
## colnames(3990): sample24.1 sample24.8 ... sample28.4725
## sample28.4728
## colData names(7): cell collection ... nGenes mt
## reducedDimNames(0):
## spikeNames(0):
saveRDS(sce.GQ, paste0(dir, "E-MTAB-6153_normCounts_GQ.Rds"))
Now that we have normalised data we can have a look at the general structure of the dataset. Use principal component analysis to do this. How would you describe the dataset? Do you think there are distinct subpopulations of cells or, instead, a continuous transition of transcriptional states along a differentiation path?
Question: how would you evaluate the success of the normalisation process using the PCA plot?
var <- apply(logcounts(sce.GQ), 1, var)
pca <- prcomp(t(logcounts(sce.GQ)[order(var, decreasing = TRUE)[1:500],]))
df <- data.frame(PC1=pca$x[,1], PC2=pca$x[,2])
df <- as.data.frame(cbind(df, colData(sce.GQ)))
df <- df[sample(1:nrow(df), size = nrow(df), replace = FALSE),]
plots <- list()
plots[[1]] <- ggplot(df, aes(PC1, PC2, col=embryoPool)) + geom_point(alpha=0.7)
plots[[2]] <- ggplot(df, aes(PC1, PC2)) + geom_point(aes(col=libSize)) + scale_color_gradientn(colours = rev(brewer.pal(n=9, "RdYlBu")))
plots[[3]] <- ggplot(df, aes(PC1, PC2)) + geom_point(aes(col=channel))
multiplot(plotlist = plots, cols=2)
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rtsne_0.13 RColorBrewer_1.1-2
## [3] biomaRt_2.36.1 scater_1.8.4
## [5] ggplot2_3.0.0 scran_1.8.4
## [7] SingleCellExperiment_1.2.0 SummarizedExperiment_1.10.1
## [9] DelayedArray_0.6.5 BiocParallel_1.14.2
## [11] matrixStats_0.54.0 Biobase_2.40.0
## [13] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
## [15] IRanges_2.14.11 S4Vectors_0.18.3
## [17] BiocGenerics_0.26.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] httr_1.3.1 progress_1.2.0
## [5] rprojroot_1.3-2 dynamicTreeCut_1.63-1
## [7] tools_3.5.0 backports_1.1.2
## [9] irlba_2.3.2 R6_2.2.2
## [11] DT_0.4 vipor_0.4.5
## [13] DBI_1.0.0 lazyeval_0.2.1
## [15] colorspace_1.3-2 withr_2.1.2
## [17] prettyunits_1.0.2 tidyselect_0.2.4
## [19] gridExtra_2.3 curl_3.2
## [21] bit_1.1-14 compiler_3.5.0
## [23] labeling_0.3 scales_1.0.0
## [25] stringr_1.3.1 digest_0.6.16
## [27] rmarkdown_1.10 XVector_0.20.0
## [29] pkgconfig_2.0.2 htmltools_0.3.6
## [31] limma_3.36.3 htmlwidgets_1.2
## [33] rlang_0.2.2 RSQLite_2.1.1
## [35] rstudioapi_0.7 FNN_1.1.2.1
## [37] shiny_1.1.0 DelayedMatrixStats_1.2.0
## [39] bindr_0.1.1 dplyr_0.7.6
## [41] RCurl_1.95-4.11 magrittr_1.5
## [43] GenomeInfoDbData_1.1.0 Matrix_1.2-14
## [45] Rcpp_0.12.18 ggbeeswarm_0.6.0
## [47] munsell_0.5.0 Rhdf5lib_1.2.1
## [49] viridis_0.5.1 stringi_1.2.4
## [51] yaml_2.2.0 edgeR_3.22.3
## [53] zlibbioc_1.26.0 rhdf5_2.24.0
## [55] plyr_1.8.4 blob_1.1.1
## [57] grid_3.5.0 promises_1.0.1
## [59] shinydashboard_0.7.0 crayon_1.3.4
## [61] lattice_0.20-35 hms_0.4.2
## [63] locfit_1.5-9.1 knitr_1.20
## [65] pillar_1.3.0 igraph_1.2.2
## [67] rjson_0.2.20 reshape2_1.4.3
## [69] XML_3.98-1.16 glue_1.3.0
## [71] evaluate_0.11 data.table_1.11.4
## [73] httpuv_1.4.5 gtable_0.2.0
## [75] purrr_0.2.5 assertthat_0.2.0
## [77] mime_0.5 xtable_1.8-3
## [79] later_0.7.4 viridisLite_0.3.0
## [81] tibble_1.4.2 memoise_1.1.0
## [83] AnnotationDbi_1.42.1 beeswarm_0.2.3
## [85] tximport_1.8.0 bindrcpp_0.2.2
## [87] statmod_1.4.30