In the past few years, single-cell RNA-sequencing (scRNA-seq) technologies have rapidly evolved and now it is possible to produce very good quality data for hundreds to thousands of cells in a single experiment.

Svensson et al, Nat Prot, 2018 (doi.org/10.1038/nprot.2017.149)

Svensson et al, Nat Prot, 2018 (doi.org/10.1038/nprot.2017.149)

The most popular methods to isolate and process single cells are shown below. Each has different pros and cons and are suited for different questions.

Wang and Song et al., Clin Transl Med, 2017 (doi.org/10.1186/s40169-017-0139-4)

Wang and Song et al., Clin Transl Med, 2017 (doi.org/10.1186/s40169-017-0139-4)

scRNA-seq data is characterised by being very sparse and having much higher noise levels compared to bulk RNA-seq.

Brennecke et al., Nat Methods, 2013 (doi: https://)doi.org/10.1038/nmeth.2645)

Brennecke et al., Nat Methods, 2013 (doi: https://)doi.org/10.1038/nmeth.2645)

The observed zeroes in the count data can be biological (the gene is not expressed) or technical (the gene is expressed but was not detected, referred to as drop-outs).

Setup

In this tutorial we will go through a basic workflow of scRNA-seq data analysis.

First, we need to load the packages that we will need.

## note the chunk options message and warning; these suppress printing out the messages from loading the packages to the HTML output.
## you could also set include=FALSE to completely suppress the chunk being shown in the HTML.
library(scater)
library(scran)
library(SingleCellExperiment)
library(biomaRt)
library(ggplot2)
library(ggridges)
library(RColorBrewer)
library(ggthemes)
library(dplyr)
library(Rtsne)

## set up a vector of colours for plotting
col_list = c(brewer.pal(8,"Set1"), brewer.pal(8,"Set2"), brewer.pal(12,"Set3"), brewer.pal(12,"Paired"),
             brewer.pal(9,"Pastel1"), brewer.pal(8,"Pastel2"), brewer.pal(8,"Dark2"), brewer.pal(8,"Accent"))

Now, define the working directory where you have saved this document.

dir <- "/Users/ibarra01/OneDrive - CRUK Cambridge Institute/Conferences/CONTRA-2018/tutorial/"

Finally, download the data from here: And save it in dir.

Dataset

The data comes from the paper “A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation” from Nestorowa et al., 2016 (https://doi.org/10.1182/blood-2016-05-716480). Raw data has been deposited in the Gene Expression Omnibus database under accession GSE81682 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81682). The submission includes the table of raw counts as supplementary information.

We have downloaded this file (ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81682/suppl/GSE81682_HTSeq_counts.txt.gz) and read it into R. Reading large text files is slow; to make things easier we have resaved the count table as an Rds file. Rds files are compressed files containing a single R object.

## do not run!
# all.counts <- read.delim(paste0(dir,"HTSeq_counts.txt"), row.names = 1) ## original file from GEO
## the table includes stats from HTSeq on the number of reads that are ambiguous, not-aligned, multimapped etc. These are not part of the gene counts so we remove them.
# all.counts = all.counts[1:(nrow(all.counts)-5),]
# saveRDS(all.counts, paste0(dir,"GSE81682_HTSeq_counts.Rds"))

So let’s load the count data.

all.counts <- readRDS(paste0(dir,"GSE81682_HTSeq_counts.Rds"))
all.counts[1:5,1:5]
##                    HSPC_007 HSPC_013 HSPC_019 HSPC_025 HSPC_031
## ENSMUSG00000000001        0        7        1      185        2
## ENSMUSG00000000003        0        0        0        0        0
## ENSMUSG00000000028        4        1        2        4        3
## ENSMUSG00000000031        0        0        0        0        0
## ENSMUSG00000000037        0        0        0        0        0

As you can see, the rows correspond to genes and the columns to cells. Each entry in the table is the number of reads mapped to each gene in each cell. The dimensions of the table then tell us there are 1920 cells and 46170 genes in this dataset.

dim(all.counts)
## [1] 46170  1920

We also read in the annotation file, which contains information about each of the cells, related to the experimental design.

meta <- read.csv(paste0(dir,"nestorowa_meta_data.csv"))

## the variables are categorical, so we need to convert numeric entries to factors
meta$sequencing_lane <- as.factor(meta$sequencing_lane)
meta$cell_sorting_day <- as.factor(meta$cell_sorting_day)
head(meta)
##   Cell.Name sequencing_lane cell_sorting_day mouse_sample_information
## 1  HSPC_001            9501                1          pooled_sample_1
## 2  HSPC_002            9501                1          pooled_sample_1
## 3  HSPC_003            9501                1          pooled_sample_1
## 4  HSPC_004            9501                1          pooled_sample_1
## 5  HSPC_005            9501                1          pooled_sample_1
## 6  HSPC_006            9501                1          pooled_sample_1
##   plate_preparation_day
## 1                 day_1
## 2                 day_1
## 3                 day_1
## 4                 day_1
## 5                 day_1
## 6                 day_1

Experimental design

The main question to answer in this study was what are the molecular changes happening during hematopoietic stem and progenitor cell (HSPCs) differentiation. To this end, the authors collected HSPCs “from the bone marrow of 10 female 12-week-old C57BL/6 mice over 2 consecutive days, with cells from 4 mice pooled together and cells from 1 mouse analyzed separately each day”. For the pooled samples, 8 96-well plates worth of cells were collected and 2 for the single-mouse samples. Library prep was done on 5 different days. Each plate was sequenced across one lane of an Illumina HiSeq2500.

All the information is contained in the metadata.

SingleCellExperiment

To work with these data we will use a SingleCellExperiment object, which is able to hold the count table along with the associated metadata. The advantage of creating one object containing both of these, is that whenever we modify the counts -for example, by removing some cells- the metadata is updated accordingly.

The assays slot is used to store the count data. However, this slot can hold several different quantifications (like both raw and normalised counts) simultaneously. Row and column metadata is attached via rowData and colData.

sce <- SingleCellExperiment(assays = list(counts = as.matrix(all.counts)),
                            colData = meta)
sce
## class: SingleCellExperiment 
## dim: 46170 1920 
## metadata(0):
## assays(1): counts
## rownames(46170): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ERCC-00170 ERCC-00171
## rowData names(0):
## colnames(1920): HSPC_007 HSPC_013 ... Prog_852 Prog_810
## colData names(5): Cell.Name sequencing_lane cell_sorting_day
##   mouse_sample_information plate_preparation_day
## reducedDimNames(0):
## spikeNames(0):

We can check the different components stored:

counts(sce)[1:5,1:5] ## retrieve the count table
##                    HSPC_007 HSPC_013 HSPC_019 HSPC_025 HSPC_031
## ENSMUSG00000000001        0        7        1      185        2
## ENSMUSG00000000003        0        0        0        0        0
## ENSMUSG00000000028        4        1        2        4        3
## ENSMUSG00000000031        0        0        0        0        0
## ENSMUSG00000000037        0        0        0        0        0
head(rowData(sce)) ## empty at the moment; can hold info related to the rows (genes)
## DataFrame with 6 rows and 0 columns
head(colData(sce)) ## holds the metadata related to the columns (meta)
## DataFrame with 6 rows and 5 columns
##          Cell.Name sequencing_lane cell_sorting_day
##           <factor>        <factor>         <factor>
## HSPC_007  HSPC_001            9501                1
## HSPC_013  HSPC_002            9501                1
## HSPC_019  HSPC_003            9501                1
## HSPC_025  HSPC_004            9501                1
## HSPC_031  HSPC_005            9501                1
## HSPC_037  HSPC_006            9501                1
##          mouse_sample_information plate_preparation_day
##                          <factor>              <factor>
## HSPC_007          pooled_sample_1                 day_1
## HSPC_013          pooled_sample_1                 day_1
## HSPC_019          pooled_sample_1                 day_1
## HSPC_025          pooled_sample_1                 day_1
## HSPC_031          pooled_sample_1                 day_1
## HSPC_037          pooled_sample_1                 day_1

Gene annotation

The rows of the count data contain Ensembl gene identifiers. The paper specifies that the annotation used was from version 81. It is helpful to collect the gene symbol for each gene as these are much more informative than the Ensembl ID.

To do this we use the biomaRt package to query Ensembl’s BioMart database, which contains all the associated information stored for each ID. A new release of the genome annotation is released every three months and each release contains improved and better curated annotation. However, this means that some genes are removed, others are added and some features are updated, like the gene symbols. Thus, it is important to use the same version of the database to get accurate and complete information about all the gene IDs.

## use version 81 which corresponds to the release from July 2015.
## https://www.ensembl.org/info/website/archives/index.html
ensembl <- useMart(host='jul2015.archive.ensembl.org', 
                     biomart='ENSEMBL_MART_ENSEMBL', 
                     dataset='mmusculus_gene_ensembl')

## retireve the gene symbol and chromosomal location
geneInfo <- suppressMessages(getBM(filters = "ensembl_gene_id", values = row.names(sce), attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = ensembl))
head(geneInfo)
##      ensembl_gene_id external_gene_name chromosome_name
## 1 ENSMUSG00000000001              Gnai3               3
## 2 ENSMUSG00000000003               Pbsn               X
## 3 ENSMUSG00000000028              Cdc45              16
## 4 ENSMUSG00000000031                H19               7
## 5 ENSMUSG00000000037              Scml2               X
## 6 ENSMUSG00000000049               Apoh              11

We can now add this information as metadata for the rows of our SingleCellExperiment object with rowData.

## ensure the retireved annotation is in the same order as the row.names
stopifnot(identical(grep("ENS", row.names(sce), value = TRUE), geneInfo$ensembl_gene_id))
## create empty annotation for the ERCC spike-ins
spikeInfo <- data.frame(ensembl_gene_id = grep("ERCC", row.names(sce), value = TRUE), external_gene_name = NA, chromosome_name = NA)

geneInfo <- rbind(geneInfo, spikeInfo)

rowData(sce)$ENSEMBL <- geneInfo$ensembl_gene_id
rowData(sce)$SYMBOL <- geneInfo$external_gene_name
rowData(sce)$CHR <- geneInfo$chromosome_name
head(rowData(sce))
## DataFrame with 6 rows and 3 columns
##              ENSEMBL      SYMBOL         CHR
##          <character> <character> <character>
## 1 ENSMUSG00000000001       Gnai3           3
## 2 ENSMUSG00000000003        Pbsn           X
## 3 ENSMUSG00000000028       Cdc45          16
## 4 ENSMUSG00000000031         H19           7
## 5 ENSMUSG00000000037       Scml2           X
## 6 ENSMUSG00000000049        Apoh          11

Finally, we can remove any genes that are not expressed in any cells.

keep_feature <- rowSums(counts(sce)) > 0
summary(keep_feature)
##    Mode   FALSE    TRUE 
## logical    2820   43350
sce <- sce[keep_feature, ]

QC

The first step in any analysis workflow is to assess the quality of the dataset. It is important to ensure that the libraries that were sequenced were of good quality, representing healthy cells.

To check the quality of the data we use several metrics:

  • Library size: bad quality libraries usually don’t produce as many sequencing reads.

  • Number of genes detected: the number of different genes is a measurement of the complexity of the library. If the starting material is very low many fewer different genes will be captured and represented in the final data.

  • Percentage of reads mapping to mitochondrial genes: it has been observed that bad quality libraries tend to have a higher proportion of reads mapped to the mitochondrial genome. These might represent cells undergoing apoptosis.

  • Percentage of reads mapping to exogenous spike-in controls: for experiments that add synthetic spike-in molecules, the ratio of spike-in to endogenous transcripts is a measurement of the amour of RNA present in the cell.

So let’s define which features in the count matrix correspond to mitochondrial genes and spike-in controls.

mt_genes <- grepl("MT", rowData(sce)$CHR)
head(rowData(sce)$SYMBOL[mt_genes])
## [1] "mt-Tf"   "mt-Rnr1" "mt-Tv"   "mt-Rnr2" "mt-Tl1"  "mt-Nd1"
summary(mt_genes)
##    Mode   FALSE    TRUE 
## logical   43313      37
ercc_genes <- grepl("^ERCC", rownames(sce))
isSpike(sce, "ERCC") <- ercc_genes 
head(rownames(sce)[ercc_genes])
## [1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009" "ERCC-00012"
## [6] "ERCC-00013"
summary(isSpike(sce, "ERCC"))
##    Mode   FALSE    TRUE 
## logical   43258      92

Then we can use scater to calculate the metrics for all cells. The results will be stored in the object’s colData.

sce <- calculateQCMetrics(sce,
    feature_controls = list(MT = mt_genes), percent_top = 100)
colnames(colData(sce))
##  [1] "Cell.Name"                                     
##  [2] "sequencing_lane"                               
##  [3] "cell_sorting_day"                              
##  [4] "mouse_sample_information"                      
##  [5] "plate_preparation_day"                         
##  [6] "is_cell_control"                               
##  [7] "total_features_by_counts"                      
##  [8] "log10_total_features_by_counts"                
##  [9] "total_counts"                                  
## [10] "log10_total_counts"                            
## [11] "pct_counts_in_top_100_features"                
## [12] "total_features"                                
## [13] "log10_total_features"                          
## [14] "pct_counts_top_100_features"                   
## [15] "total_features_by_counts_endogenous"           
## [16] "log10_total_features_by_counts_endogenous"     
## [17] "total_counts_endogenous"                       
## [18] "log10_total_counts_endogenous"                 
## [19] "pct_counts_endogenous"                         
## [20] "pct_counts_in_top_100_features_endogenous"     
## [21] "total_features_endogenous"                     
## [22] "log10_total_features_endogenous"               
## [23] "pct_counts_top_100_features_endogenous"        
## [24] "total_features_by_counts_feature_control"      
## [25] "log10_total_features_by_counts_feature_control"
## [26] "total_counts_feature_control"                  
## [27] "log10_total_counts_feature_control"            
## [28] "pct_counts_feature_control"                    
## [29] "pct_counts_in_top_100_features_feature_control"
## [30] "total_features_feature_control"                
## [31] "log10_total_features_feature_control"          
## [32] "pct_counts_top_100_features_feature_control"   
## [33] "total_features_by_counts_MT"                   
## [34] "log10_total_features_by_counts_MT"             
## [35] "total_counts_MT"                               
## [36] "log10_total_counts_MT"                         
## [37] "pct_counts_MT"                                 
## [38] "total_features_MT"                             
## [39] "log10_total_features_MT"                       
## [40] "total_features_by_counts_ERCC"                 
## [41] "log10_total_features_by_counts_ERCC"           
## [42] "total_counts_ERCC"                             
## [43] "log10_total_counts_ERCC"                       
## [44] "pct_counts_ERCC"                               
## [45] "total_features_ERCC"                           
## [46] "log10_total_features_ERCC"

We can now check the distribution of each metric across the cell population. It is useful to class the samples based on relevant experimental design variables to easily spot if any particular conditions or batches failed or are behaving differently. We can do this with density plots:

# library size
p1 <- ggplot(as.data.frame(colData(sce)), aes(x = log10_total_counts_endogenous, y = plate_preparation_day,
                                        fill = plate_preparation_day)) +
    geom_density_ridges(scale = 4) + 
    geom_vline(xintercept = log10(200000), linetype = 2, colour = "firebrick", lwd=2) +
    scale_fill_manual(values = col_list) +
    ggtitle(expression(log[10]*" library size")) + 
    xlab("") + ylab("") +
    scale_y_discrete(expand = c(0.01, 0)) + # will generally have to set the `expand` option
    theme_ridges() + theme(axis.text.y = element_text(size = 8), legend.position = "none", plot.title = element_text(hjust = 0.5, face=1)) 

# total detected genes
p2 <- ggplot(as.data.frame(colData(sce)), aes(x = total_features, y = plate_preparation_day,
                                        fill = plate_preparation_day)) +
    geom_density_ridges(scale = 4) + 
    geom_vline(xintercept = 4000, linetype = 2, colour = "firebrick", lwd=2) +
    scale_fill_manual(values = col_list) +
    ggtitle("total features detected") +
    xlab("") + ylab("") +
    scale_y_discrete(expand = c(0.01, 0)) +
    theme_ridges() + theme(axis.text.y = element_text(size = 8), legend.position = "none", plot.title = element_text(hjust = 0.5, face=1))

# percentage reads in mitochondrial genes
p3 <- ggplot(as.data.frame(colData(sce)), aes(x = pct_counts_MT, y = plate_preparation_day,
                                        fill = plate_preparation_day)) +
    geom_density_ridges(scale = 4) +
    geom_vline(xintercept = 10, linetype = 2, colour = "firebrick", lwd=2) +
    scale_fill_manual(values = col_list) +
    ggtitle("% reads in mitochondrial genes") +
    xlab("") + ylab("") +
    scale_y_discrete(expand = c(0.01, 0)) +
    theme_ridges() + theme(axis.text.y = element_text(size = 8), legend.position = "none", plot.title = element_text(hjust = 0.5, face=1))

# percentage of counts taken by the top 100 expressed features
p4 <- ggplot(as.data.frame(colData(sce)), aes(x = pct_counts_top_100_features, y = plate_preparation_day,
                                        fill = plate_preparation_day)) +
    geom_density_ridges(scale = 4) + 
    scale_fill_manual(values = col_list) +
    ggtitle("% counts from top 100 expressed genes") +
    xlab("") + ylab("") +
    scale_y_discrete(expand = c(0.01, 0)) +
    theme_ridges() + theme(axis.text.y = element_text(size = 8), legend.position = "none", plot.title = element_text(hjust = 0.5, face=1))

multiplot(plotlist = list(p1, p2, p3, p4), cols = 2)

Or violin plots are a good alternative when there are many variables to stratify the samples.

sce$replicate <- factor(sce$mouse_sample_information, labels = c("P1", "P2", "S1", "S2"))

plotlist = list(plotColData(sce, y = "total_counts", x = "sequencing_lane", colour_by = "replicate") + 
    scale_fill_manual(values = col_list),
    plotColData(sce, y = "pct_counts_ERCC", x = "sequencing_lane", colour_by = "replicate")+ 
    scale_fill_manual(values = col_list),
    plotColData(sce, y = "total_features", x = "sequencing_lane", colour_by = "replicate")+ 
    scale_fill_manual(values = col_list),
    plotColData(sce, y = "pct_counts_MT", x = "sequencing_lane", colour_by = "replicate")+ 
    scale_fill_manual(values = col_list))

multiplot(plotlist = plotlist, cols = 2) 

There are several ways to decide which samples to exclude. One possibility is to examine the plots above and set thresholds that exclude samples that deviate from the general behaviour of the whole dataset. For example, we can use the thresholds reported in the paper and exclude this many cells:

bad_qual <- (sce$total_counts_endogenous < 200000 |
               sce$total_features < 4000 |
               sce$pct_counts_MT > 10 |
               sce$pct_counts_ERCC > 50)
summary(bad_qual)
##    Mode   FALSE    TRUE 
## logical    1676     244

An alternative approach is to use the median absolute deviation (MAD) to quantify the variability of each metric and remove any samples that deviate more than x MADs.

The function isOutlier from scater can perform this type of analysis.

In this case we will use the thresholds reported in the paper. We remove all the bad quality cells and save the new object for future use.

sce_GQ = sce[,!bad_qual]
sce_GQ <- sce_GQ[rowMeans(counts(sce_GQ))>0,]
sce_GQ
## class: SingleCellExperiment 
## dim: 43239 1676 
## metadata(0):
## assays(1): counts
## rownames(43239): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ERCC-00170 ERCC-00171
## rowData names(14): ENSEMBL SYMBOL ... n_cells_counts
##   pct_dropout_counts
## colnames(1676): HSPC_025 HSPC_031 ... Prog_852 Prog_810
## colData names(47): Cell.Name sequencing_lane ...
##   log10_total_features_ERCC replicate
## reducedDimNames(0):
## spikeNames(1): ERCC
saveRDS(sce_GQ, paste0(dir,"nestorowa_rawcounts_goodQuality.Rds"))

Normalisation

Once the dataset has been QCed, it is necessary to normalise the data so that the counts are comparable across cells.

We achieve this by calculating size factors to scale counts.

scRNA-seq data suffers from additional biases.

Vallejos et al., Nat Methods, 2017 (doi.org/10.1038/nmeth.4292)

Vallejos et al., Nat Methods, 2017 (doi.org/10.1038/nmeth.4292)

The methods developed to normalise bulk RNA-seq data usually don’t perform very well with scRNA-seq data because there are too many zeroes, which results in unstable size factor estimation. Lun et al. developed a method that based on pooling cells together to counteract the amount of zeroes.

Lun et al., Genome Biology, 2016 (doi.org/10.1186/s13059-016-0947-7)

Lun et al., Genome Biology, 2016 (doi.org/10.1186/s13059-016-0947-7)

To estimate the size factors we will only use genes with an average abundance above a certain threshold. Such minimum average expression threshold is dependent on the type of data being analysed; droplet-based methods yield much lower counts compared to plate-based approaches, so this parameter should be set accordingly.

Thus, we can check the distribution of mean expression values and filter out the genes with very low abundances.

meanCounts <- calcAverage(sce_GQ, use_size_factors = FALSE)
hist(log10(meanCounts), breaks = 100, main="")
abline(v = log10(1), col = "firebrick", lwd=2, lty=2)

To calculate the size factors, we first roughly cluster the cells (to avoid pooling cells that are very different and contain a large proportion of DE genes) and then use the deconvolution method to estimate the size factors per pool and then per cell.

clusters <- quickCluster(sce_GQ, min.mean = 1, method = "igraph")
sce_GQ <- computeSumFactors(sce_GQ, cluster = clusters, min.mean = 1)

Size factors should be correlated to the library size of each sample.

For this dataset we observe that this is true for the great majority of the samples, with a strong linear relationship between the two. However, a small number of cells have a different behaviour, with size factors that follow a different slope.

df <- data.frame(library_size = colSums(counts(sce_GQ)), size_factor = sizeFactors(sce_GQ))
df <- df[order(df$library_size),]
ggplot(df, aes(x = library_size, y = size_factor)) + geom_point(alpha = 0.8)

We can see that these cells have a much higher ratio of their library size to the corresponding size factors.

flagged <- df$library_size/df$size_factor > 4e06

par(mfrow=c(1,2))
plot(df$library_size, df$size_factor, pch=16, col=ifelse(flagged,  "red", "black"), bty="l", xlab="library size", ylab="size factor")
plot(df$library_size/df$size_factor, pch=16, bty="l", xlab="", ylab="library size / size factors", col=ifelse(flagged,  "red", "black"))
abline(h=4e06, lty=2, col="grey")

summary(flagged)
##    Mode   FALSE    TRUE 
## logical    1656      20

We will remove these cells.

sce_GQ <- sce_GQ[,!flagged]
saveRDS(sce_GQ, paste0(dir,"nestorowa_rawcounts_goodQuality.Rds"))

Finally, it is also good to check that the distribution of the size factors is not skewed for particular batches.

colData(sce_GQ) %>% as.data.frame %>%
    ggplot(aes(x = log2(sizeFactors(sce_GQ)), y = plate_preparation_day, fill = cell_sorting_day)) +
    xlab(expression(log[2]*" size factors")) + ylab("") +
    geom_density_ridges(alpha = 0.5) + geom_vline(xintercept = 0, linetype = 2) +
    theme_ridges() + ggtitle("Distribution of size factors") +
    scale_fill_tableau()

The spike-in transcripts should be normalised separately, since they do not suffer from the same biases as the endogenous genes. Instead, the spike-in size factors are calculated as the total counts across all spike-in transcripts, effectively only normalising differences in sequencing depth.

We set the option general.use = FALSE to ensure that these size factors are only used for the spike-in counts and not the endogenous genes.

sce_GQ <- computeSpikeFactors(sce_GQ, type = "ERCC", general.use = FALSE)

Once the size factors have been computed we use them to normalise the raw counts.

sce_norm <- normalize(sce_GQ)
saveRDS(sce_norm, paste0(dir,"nestorowa_normcounts_goodQuality.Rds"))

Dimensionality reduction

Single-cell RNA-seq technologies allow us to assess the expression of thousands of genes in every cell. We can view every gene as a different dimension. Since we are not able to think in more than three dimensions we use dimensionality reduction techniques that project the high-dimensional space onto two or three dimensions.

One of the most popular techniques to do this is to use Principal Component Analysis (PCA). As a simplified example, imagine we want to obtain the best 1D representation of a 2D scatter plot.

To reduce a 3D dataset to a 2D scatter plot we find the plane that best captures the variation in the data.

A good explanation of PCA (in case you are not familiar with it) is here: https://www.youtube.com/watch?v=FgakZw6K1QQ

So let’s now compute PCA on our dataset, using the top 500 most variable genes.

var_genes = apply(logcounts(sce_norm), 1, var)

pcs = prcomp(t(logcounts(sce_norm)[order(var_genes, decreasing = TRUE)[1:500],]))

df = as.data.frame(cbind(colData(sce_norm), pcs$x[,1:10]))
perc_var <- round(summary(pcs)$importance[2,]*100, 1)

ggplot(df, aes(x = PC1, y = PC2, colour = plate_preparation_day)) +
  geom_point(alpha = 0.6) + 
  xlab(paste0(perc_var[1],"% variance explained")) +
  ylab(paste0(perc_var[2],"% variance explained"))

An alternative is to use t-Distributed Stochastic Neighbor Embedding (t-SNE).

  • It doesn’t assume linearity like PCA so it can handle non-linear relationships better.

  • It is not deterministic so you will get slightly different results every time you run it (set a seed!)

tSNE has one important parameter called perplexity. It reflects the number and density expected in the clusters. The algorithm is reasonably robust to this value but it is always a good idea to check a few different values.

Again, this video provides a good explanation of how t-SNE works (https://www.youtube.com/watch?v=NEaUSP4YerM).

So let’s now compute a t-SNE plot from the data.

set.seed(135)
tsne <- Rtsne(t(logcounts(sce_norm)[order(var_genes, decreasing = TRUE)[1:500],]), pca = TRUE)
head(tsne$Y)
##           [,1]       [,2]
## [1,] -15.43787  -6.980783
## [2,] -17.57317  -2.363815
## [3,] -14.79197 -11.928031
## [4,] -23.76820   2.657789
## [5,] -27.55695  10.519399
## [6,] -22.04607  -2.772922
df <- cbind(data.frame(comp1 = tsne$Y[,1], comp2 = tsne$Y[,2]), as.data.frame(colData(sce_GQ)))
ggplot(df, aes(x = comp1, y = comp2, colour = plate_preparation_day)) + geom_point(alpha = 0.4)

And let’s also check a few different values of perplexity.

set.seed(135)
perplexities <- c(10,30,50,70)
tsne_list <- list()
for (i in 1:length(perplexities)){
  tsne_list[[i]] = Rtsne(t(logcounts(sce_norm)[order(var_genes, decreasing = TRUE)[1:500],]), pca = TRUE, perplexity = perplexities[i])
  }

plot_list <- list()
for (i in 1:length(tsne_list)){
  df <- data.frame(comp1 = tsne_list[[i]]$Y[,1], comp2 = tsne_list[[i]]$Y[,2])
  df <- cbind(df, as.data.frame(colData(sce_norm)))
  plot_list[[i]] <- ggplot(df, aes(x = comp1, y = comp2, colour = plate_preparation_day)) + 
    geom_point(alpha = 0.2) + ggtitle(paste0("perplexity = ",perplexities[i])) + theme(legend.position = "none")
}
multiplot(plotlist = plot_list, cols = 2)

Finally, it is good to colour the cells based on the variables that could induce technical batch effects to ensure that the clustering is not driven by batch.

plot_list <- list()
df <- data.frame(comp1 = tsne$Y[,1], comp2 = tsne$Y[,2])
df <- cbind(df, as.data.frame(colData(sce_GQ)))
plot_list[[1]] <- ggplot(df, aes(x = comp1, y = comp2, colour = plate_preparation_day)) + 
    geom_point(alpha = 0.2)
plot_list[[2]] <- ggplot(df, aes(x = comp1, y = comp2, colour = cell_sorting_day)) + 
    geom_point(alpha = 0.2)
plot_list[[3]] <- ggplot(df, aes(x = comp1, y = comp2, colour = mouse_sample_information)) + 
    geom_point(alpha = 0.2) 
multiplot(plotlist = plot_list, cols = 2)

We will stop the analysis here. Throughout the analysis we have been saving the objects with the analysis results so we will be able to pick up form here tomorrow.

## it is good practice to include the session information to have a record of the version of all the packages used
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rtsne_0.13                  dplyr_0.7.6                
##  [3] ggthemes_4.0.1              RColorBrewer_1.1-2         
##  [5] ggridges_0.5.0              biomaRt_2.36.1             
##  [7] scran_1.8.4                 scater_1.8.4               
##  [9] SingleCellExperiment_1.2.0  SummarizedExperiment_1.10.1
## [11] DelayedArray_0.6.5          BiocParallel_1.14.2        
## [13] matrixStats_0.54.0          GenomicRanges_1.32.6       
## [15] GenomeInfoDb_1.16.0         IRanges_2.14.11            
## [17] S4Vectors_0.18.3            ggplot2_3.0.0              
## [19] Biobase_2.40.0              BiocGenerics_0.26.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             bit64_0.9-7             
##  [3] httr_1.3.1               progress_1.2.0          
##  [5] rprojroot_1.3-2          dynamicTreeCut_1.63-1   
##  [7] tools_3.5.0              backports_1.1.2         
##  [9] irlba_2.3.2              R6_2.2.2                
## [11] DT_0.4                   vipor_0.4.5             
## [13] DBI_1.0.0                lazyeval_0.2.1          
## [15] colorspace_1.3-2         withr_2.1.2             
## [17] prettyunits_1.0.2        tidyselect_0.2.4        
## [19] gridExtra_2.3            curl_3.2                
## [21] bit_1.1-14               compiler_3.5.0          
## [23] labeling_0.3             scales_1.0.0            
## [25] stringr_1.3.1            digest_0.6.16           
## [27] rmarkdown_1.10           XVector_0.20.0          
## [29] pkgconfig_2.0.2          htmltools_0.3.6         
## [31] limma_3.36.3             highr_0.7               
## [33] htmlwidgets_1.2          rlang_0.2.2             
## [35] rstudioapi_0.7           RSQLite_2.1.1           
## [37] FNN_1.1.2.1              shiny_1.1.0             
## [39] DelayedMatrixStats_1.2.0 bindr_0.1.1             
## [41] RCurl_1.95-4.11          magrittr_1.5            
## [43] GenomeInfoDbData_1.1.0   Matrix_1.2-14           
## [45] Rcpp_0.12.18             ggbeeswarm_0.6.0        
## [47] munsell_0.5.0            Rhdf5lib_1.2.1          
## [49] viridis_0.5.1            stringi_1.2.4           
## [51] yaml_2.2.0               edgeR_3.22.3            
## [53] zlibbioc_1.26.0          rhdf5_2.24.0            
## [55] plyr_1.8.4               grid_3.5.0              
## [57] blob_1.1.1               promises_1.0.1          
## [59] shinydashboard_0.7.0     crayon_1.3.4            
## [61] lattice_0.20-35          hms_0.4.2               
## [63] locfit_1.5-9.1           knitr_1.20              
## [65] pillar_1.3.0             igraph_1.2.2            
## [67] rjson_0.2.20             reshape2_1.4.3          
## [69] XML_3.98-1.16            glue_1.3.0              
## [71] evaluate_0.11            data.table_1.11.4       
## [73] httpuv_1.4.5             gtable_0.2.0            
## [75] purrr_0.2.5              assertthat_0.2.0        
## [77] mime_0.5                 xtable_1.8-3            
## [79] later_0.7.4              viridisLite_0.3.0       
## [81] tibble_1.4.2             AnnotationDbi_1.42.1    
## [83] beeswarm_0.2.3           memoise_1.1.0           
## [85] tximport_1.8.0           bindrcpp_0.2.2          
## [87] statmod_1.4.30