Data description

In the following R markdown document you will have the chance to perform similar analyses to those demonstrated in the tutorial. In this case, however, you will be analysing data generated with a droplet-based platform (10X Genomics).

Question: what differences do you expect in this data compared to the plate-based dataset we’ve been using for the tutorial?

This dataset is from “Defining murine organogenesis at single-cell resolution reveals a role for the leukotriene pathway in regulating blood progenitor formation” from Ibarra-Soria et al. (2018).

The data consists of cells from mouse embryos staged at the beginning of organogenesis. This is the point during embryonic development when all the cell lineages that will give rise to all organs are specified. The raw data have been deposited in the ArrayExpress database under accession E-MTAB-6153 (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6153/) and the raw count table can be downloaded as processed data.

The dataset consists of over 20,000 cells. To keep computing time reasonable we have selected a random subset of cells for you to work with.

First, load the count table and associated metadata. As before, we have provided the count table as a binary compressed RDS file to accelerate loading.

counts <- readRDS(paste0(dir, "E-MTAB-6153_rawCounts.Rds"))
meta <- read.table(paste0(dir, "E-MTAB-6153_metadata.tsv"), header = TRUE, stringsAsFactors = FALSE)

The experimental design is simple. There are two biological replicates, each consisting of independent pools of embryos, each collected on a separate day. Pool1 was then processed across two different channels of a 10X chip (technical replicates) whereas pool2 was run on a single channel.

How many cells and genes are present in this data? Have a look at the row and column annotations.

dim(counts)
## [1] 18853  4000
head(colnames(counts))
## [1] "sample24.1"  "sample24.8"  "sample24.13" "sample24.15" "sample24.20"
## [6] "sample24.24"
head(row.names(counts))
## [1] "ENSMUSG00000051951" "ENSMUSG00000025902" "ENSMUSG00000033845"
## [4] "ENSMUSG00000025903" "ENSMUSG00000104217" "ENSMUSG00000033813"

Now use both the counts and metadata to create a SingleCellExperiment object.

sce <- SingleCellExperiment(assays = list(counts = counts), colData = meta)

Before starting the analysis, collect the gene names for all rows and their chromosomal location. Store these in the rowData of the object. Hint: The paper reports the data was analysed using Ensembl version 84; that corresponds to database mar2016.archive.ensembl.org.

ensembl <- useMart(host='mar2016.archive.ensembl.org', 
                     biomart='ENSEMBL_MART_ENSEMBL', 
                     dataset='mmusculus_gene_ensembl')

geneInfo <- suppressMessages(getBM(filters = "ensembl_gene_id", values = row.names(sce), attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = ensembl))

geneInfo <- geneInfo[match(row.names(sce), geneInfo$ensembl_gene_id),]

stopifnot(identical(grep("^ENS", row.names(sce), value = TRUE), geneInfo$ensembl_gene_id))

rowData(sce)$ENSEMBL <- geneInfo$ensembl_gene_id
rowData(sce)$SYMBOL <- geneInfo$external_gene_name
rowData(sce)$CHR <- geneInfo$chromosome_name
head(rowData(sce))
## DataFrame with 6 rows and 3 columns
##              ENSEMBL      SYMBOL         CHR
##          <character> <character> <character>
## 1 ENSMUSG00000051951        Xkr4           1
## 2 ENSMUSG00000025902       Sox17           1
## 3 ENSMUSG00000033845      Mrpl15           1
## 4 ENSMUSG00000025903      Lypla1           1
## 5 ENSMUSG00000104217     Gm37988           1
## 6 ENSMUSG00000033813       Tcea1           1

Quality control

Before starting the analysis of the data it is necessary to remove any samples of poor quality. Question: why do you think it is important to QC the dataset? How can you assess quality?

If necessary, define spike-in controls.

## droplet-based technologies like 10X do not incorporate spike-ins.

And now compute the quality metrics. Challenge: try calculating the QC metrics yourself, instead of using scater’s functions. Then add them to the object’s colData (be careful not to loose the values already stored!).

mito_genes <- grepl("MT",rowData(sce)$CHR)
qc <- data.frame(libSize = colSums(counts(sce)), 
                 nGenes = apply(counts(sce), 2, function(x) sum(x>0)),
                 mt = colSums(counts(sce)[mito_genes,])/colSums(counts(sce))*100)
colData(sce) <- cbind(colData(sce), qc)
colData(sce)
## DataFrame with 4000 rows and 7 columns
##                        cell  collection  embryoPool     channel   libSize
##                 <character> <character> <character> <character> <numeric>
## sample24.1       sample24.1        day1       pool1    channel1     18405
## sample24.8       sample24.8        day1       pool1    channel1      5734
## sample24.13     sample24.13        day1       pool1    channel1      8179
## sample24.15     sample24.15        day1       pool1    channel1     18602
## sample24.20     sample24.20        day1       pool1    channel1     16755
## ...                     ...         ...         ...         ...       ...
## sample28.4713 sample28.4713        day2       pool2    channel3     16380
## sample28.4716 sample28.4716        day2       pool2    channel3     24832
## sample28.4720 sample28.4720        day2       pool2    channel3     15639
## sample28.4725 sample28.4725        day2       pool2    channel3      7645
## sample28.4728 sample28.4728        day2       pool2    channel3     19819
##                  nGenes                mt
##               <integer>         <numeric>
## sample24.1         4087 0.635696821515892
## sample24.8         2184 0.837111963725148
## sample24.13        2677 0.721359579410686
## sample24.15        4033  0.58595849908612
## sample24.20        4164  1.13398985377499
## ...                 ...               ...
## sample28.4713      3506 0.262515262515263
## sample28.4716      4055 0.116784793814433
## sample28.4720      3634 0.294136453737451
## sample28.4725      2495 0.680183126226292
## sample28.4728      3946 0.216963519854685
plots <- list()
plots[[1]] <- plotColData(sce, y = "libSize", x = "channel")
plots[[2]] <- plotColData(sce, y = "mt", x = "channel")
plots[[3]] <- plotColData(sce, y = "nGenes", x = "channel")
multiplot(plotlist = plots, cols=2)

The authors discarded all samples that had less than 1,000 detected genes or more than 3% of the reads mapped to mitochondrial genes. Do these filters seem reasonable?

par(mfrow=c(1,2))
plot(density(colData(sce)$nGenes), main="number of genes detected")
abline(v=1000, col="red", lty=2)
plot(density(colData(sce)$mt), main="% in mitochondrial")
abline(v=3, col="red", lty=2)

Apply these filters and remove bad quality cells; also filter any genes that are now not expressed in any cell. What are the dimensions of the new dataset?

badQual <- which(colData(sce)$nGenes < 1000 | colData(sce)$mt > 3)
names(badQual) <- colnames(sce)[badQual]
badQual
##  sample24.429 sample24.1096 sample24.2429 sample24.3298 sample24.3846 
##            87           205           464           646           747 
## sample24.4551 sample24.7653 sample25.2239 sample25.2781 sample28.2877 
##           874          1440          1892          1984          3619
sce.GQ <- sce[,-badQual]
sce.GQ <- sce.GQ[rowMeans(counts(sce.GQ))>0,]
dim(sce.GQ)
## [1] 18852  3990

Finally, save the object containing the good-quality samples for future use.

saveRDS(sce.GQ, file=paste0(dir, "E-MTAB-6153_rawCounts_GQ.Rds"))

Normalisation

The next step is to normalise the data so that the counts are comparable across cells. We will use again the deconvolution method proposed by Lun et al. (2016). Can you remember the steps needed to do this? What is achieved through each step?

Have a look at the documentation of the functions used for normalisation. The parameter min.mean is set to 1 by default. Is this threshold appropriate for this dataset?

meanCounts <- calcAverage(sce.GQ, use_size_factors=FALSE)
hist(log10(meanCounts), breaks = 100, main="", col = "grey", xlab=expression(log[10]~"average count"))
abline(v=log10(c(1, 0.1)), col=c("red", "blue"))

Calculate the size factors adjusting the necessary parameters. Hint: the igraph method is more appropriate for large datasets. Still, this step might take a minute or so.

How can you assess if the size factor estimation was successful?

clusters <- quickCluster(sce.GQ, method = "igraph", min.mean = 0.1)
sce.GQ <- computeSumFactors(sce.GQ, cluster = clusters)

plot(colSums(counts(sce.GQ))/1e03, sizeFactors(sce.GQ), xlab = "total counts (x 1000)", ylab = "size factors", bty="l")

Finally, apply the size factors to the raw counts to generate normalised counts and save the resulting object for future use.

sce.GQ <- normalise(sce.GQ)
sce.GQ
## class: SingleCellExperiment 
## dim: 18852 3990 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(18852): ENSMUSG00000051951 ENSMUSG00000025902 ...
##   ENSMUSG00000063897 ENSMUSG00000095742
## rowData names(3): ENSEMBL SYMBOL CHR
## colnames(3990): sample24.1 sample24.8 ... sample28.4725
##   sample28.4728
## colData names(7): cell collection ... nGenes mt
## reducedDimNames(0):
## spikeNames(0):
saveRDS(sce.GQ, paste0(dir, "E-MTAB-6153_normCounts_GQ.Rds"))

Dimensionality reduction

Now that we have normalised data we can have a look at the general structure of the dataset. Use principal component analysis to do this. How would you describe the dataset? Do you think there are distinct subpopulations of cells or, instead, a continuous transition of transcriptional states along a differentiation path?

Question: how would you evaluate the success of the normalisation process using the PCA plot?

var <- apply(logcounts(sce.GQ), 1, var)
pca <- prcomp(t(logcounts(sce.GQ)[order(var, decreasing = TRUE)[1:500],]))

df <- data.frame(PC1=pca$x[,1], PC2=pca$x[,2])
df <- as.data.frame(cbind(df, colData(sce.GQ)))
df <- df[sample(1:nrow(df), size = nrow(df), replace = FALSE),]

plots <- list()
plots[[1]] <- ggplot(df, aes(PC1, PC2, col=embryoPool)) + geom_point(alpha=0.7)
plots[[2]] <- ggplot(df, aes(PC1, PC2)) + geom_point(aes(col=libSize)) + scale_color_gradientn(colours = rev(brewer.pal(n=9, "RdYlBu")))
plots[[3]] <- ggplot(df, aes(PC1, PC2)) + geom_point(aes(col=channel)) 
multiplot(plotlist = plots, cols=2)

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rtsne_0.13                  RColorBrewer_1.1-2         
##  [3] biomaRt_2.36.1              scater_1.8.4               
##  [5] ggplot2_3.0.0               scran_1.8.4                
##  [7] SingleCellExperiment_1.2.0  SummarizedExperiment_1.10.1
##  [9] DelayedArray_0.6.5          BiocParallel_1.14.2        
## [11] matrixStats_0.54.0          Biobase_2.40.0             
## [13] GenomicRanges_1.32.6        GenomeInfoDb_1.16.0        
## [15] IRanges_2.14.11             S4Vectors_0.18.3           
## [17] BiocGenerics_0.26.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             bit64_0.9-7             
##  [3] httr_1.3.1               progress_1.2.0          
##  [5] rprojroot_1.3-2          dynamicTreeCut_1.63-1   
##  [7] tools_3.5.0              backports_1.1.2         
##  [9] irlba_2.3.2              R6_2.2.2                
## [11] DT_0.4                   vipor_0.4.5             
## [13] DBI_1.0.0                lazyeval_0.2.1          
## [15] colorspace_1.3-2         withr_2.1.2             
## [17] prettyunits_1.0.2        tidyselect_0.2.4        
## [19] gridExtra_2.3            curl_3.2                
## [21] bit_1.1-14               compiler_3.5.0          
## [23] labeling_0.3             scales_1.0.0            
## [25] stringr_1.3.1            digest_0.6.16           
## [27] rmarkdown_1.10           XVector_0.20.0          
## [29] pkgconfig_2.0.2          htmltools_0.3.6         
## [31] limma_3.36.3             htmlwidgets_1.2         
## [33] rlang_0.2.2              RSQLite_2.1.1           
## [35] rstudioapi_0.7           FNN_1.1.2.1             
## [37] shiny_1.1.0              DelayedMatrixStats_1.2.0
## [39] bindr_0.1.1              dplyr_0.7.6             
## [41] RCurl_1.95-4.11          magrittr_1.5            
## [43] GenomeInfoDbData_1.1.0   Matrix_1.2-14           
## [45] Rcpp_0.12.18             ggbeeswarm_0.6.0        
## [47] munsell_0.5.0            Rhdf5lib_1.2.1          
## [49] viridis_0.5.1            stringi_1.2.4           
## [51] yaml_2.2.0               edgeR_3.22.3            
## [53] zlibbioc_1.26.0          rhdf5_2.24.0            
## [55] plyr_1.8.4               blob_1.1.1              
## [57] grid_3.5.0               promises_1.0.1          
## [59] shinydashboard_0.7.0     crayon_1.3.4            
## [61] lattice_0.20-35          hms_0.4.2               
## [63] locfit_1.5-9.1           knitr_1.20              
## [65] pillar_1.3.0             igraph_1.2.2            
## [67] rjson_0.2.20             reshape2_1.4.3          
## [69] XML_3.98-1.16            glue_1.3.0              
## [71] evaluate_0.11            data.table_1.11.4       
## [73] httpuv_1.4.5             gtable_0.2.0            
## [75] purrr_0.2.5              assertthat_0.2.0        
## [77] mime_0.5                 xtable_1.8-3            
## [79] later_0.7.4              viridisLite_0.3.0       
## [81] tibble_1.4.2             memoise_1.1.0           
## [83] AnnotationDbi_1.42.1     beeswarm_0.2.3          
## [85] tximport_1.8.0           bindrcpp_0.2.2          
## [87] statmod_1.4.30