## Introduction

This notebook will walk you through some of the analyses presented in Cusanovich et. al (bioRxiv 2017) for working with sci-ATAC-seq data from developing Drosophila melanogaster embryos.
The tutorial is broken into the following sections:

1. Using latent semantic indexing (‘LSI’) to identify clades of cells with similar chromatin accessibility profiles (jump to section)
3. Identifying differentially accessible sites between clusters of cells (jump to section)
4. Arranging cells along developmental trajectories with “pseudotemporal ordering” (jump to section)

## Installation

### Required R Packages

• Matrix
• proxy
• gplots
• Rtsne
• densityClust (from here https://github.com/Xiaojieqiu/densityClust)
• DDRTree (version 0.1.4)
• monocle (version 2.5.3)
• irlba (version 1.0.3)

It is important that the correct version of monocle, DDRTree and irlba are installed. Other versions may produce different results. irlba v1.0.3 was used to generate the figures in the manuscript that are relevant to use cases 1 & 2, while monocle v2.5.3 and DDRTree are the versions that were used for the trajectories presented in the paper. For the first two use cases, we will install the legacy version of irlba (1.0.3) and for the second two we will detach irlba and then load monocle 2.5.3 and DDRTree 0.1.4. To install these packages, open a command line and run:

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] irlba_1.0.3      densityClust_0.3 Rtsne_0.13       gplots_3.0.1
[5] proxy_0.4-16     Matrix_1.2-10

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11        magrittr_1.5        uuid_0.1-2
[4] lattice_0.20-35     R6_2.2.2            stringr_1.2.0
[7] caTools_1.17.1      tools_3.2.1         grid_3.2.1
[10] KernSmooth_2.23-15  gtools_3.5.0        digest_0.6.12
[13] crayon_1.3.2        IRdisplay_0.4.4     repr_0.12.0
[16] bitops_1.0-6        IRkernel_0.8.6.9000 evaluate_0.10.1
[19] pbdZMQ_0.2-6        gdata_2.18.0        stringi_1.1.5
[22] jsonlite_1.5


## Use case 1: ID clades with ‘LSI’

We have found it convenient to store the data as a sparse, binary matrix of genomic loci x cells. Below we will walk you through an analysis where individual cells have been scored for insertions in 2kb windows throughout the genome. Note that if you want to go back to the raw sequencing data (FASTQ files), they are available at the Gene Expression Omnibus under accession code GSE101581. We have made tools for manipulating these files available on github. Alternatively, we have a great number of resources for interacting with the data at our companion site.

After filtering out 2kb windows that intersect with ENCODE-defined blacklist regions, we end up with 83,290 distinct 2kb windows in the drosophila genome and we have data from 7,880 cells at the 6-8 hour time point.

83290
7880

2 x 2 sparse Matrix of class "dgCMatrix"
AGCGATAGAACGAATTCGAGAACCGGAGCCTATCCT
chr2L_0_2000                                       .
chr2L_2000_4000                                    .
AGCGATAGAACGAATTCGAGTCATAGCCGTACTGAC
chr2L_0_2000                                       .
chr2L_2000_4000                                    .


Let’s collect some info about the frequency of insertion in individual windows and the variety of windows observed in individual cells. We can see that sites are observed in as many as 6,322 cells (80%), but are roughly normally distributed (with a long lower tail) on a log scale with a median of 330 cells having an insertion in each site.

Now let’s only retain the most commonly used sites (top 20,000 here). Looking at the distribution of how many of these top 20,000 sites each cell covers, we can again see that the distribution is rougly log-normal (with a long lower tail) with a median of 3,751 windows covered by each cell.

Before transforming the data, we just filter out the lowest 10% of cells (in terms of site coverage) and ensure that there are now empty sites.

We can now transform the data using TF-IDF and then generate a lower dimensional representation of the data with truncated SVD (these are the two primary steps of LSI).

Here, we only retain components 2-6 (component 1 is highly correlated with read depth) and truncate the distribution of LSI values at +/-1.5.

Next, we generate some info about the dendrograms that will be used for plotting the heatmap.

Finally, we can generate a bi-clustered heatmap showing how cells and sites are related.

## Use case 2: Re-cluster cells with t-SNE

Having identified large clades of cells that were consistent with the development of germ layers during embryogenesis, we were able to identify peaks of chromatin accessibility in each clade after “in silico sorting” the cells assigned to each cluster. We next sought to robustly identify smaller clusters of cells that might be more consistent with individual cell types so that we could learn about the individual regulatory elements that govern distinct cell states. To do so, we generated a master list of summits of accessibiity identified in each of the clades across the three time points and then created a matrix of sites by cells (similar to the last use case).

53133
7880

2 x 2 sparse Matrix of class "dgCMatrix"
AGCGATAGAACGAATTCGAGAACCGGAGCCTATCCT
chr2L_5543_5980                                    .
chr2L_6666_6881                                    .
AGCGATAGAACGAATTCGAGTCATAGCCGTACTGAC
chr2L_5543_5980                                    .
chr2L_6666_6881                                    .


The analysis starts out similarly to the LSI example above. We first filter out sites that are seen in fewer cells (in this case we only keep sites that are seen in at least 5% of cells) and then cells that have relatively low coverage (again, we filter out the lowest 10% of cells).

The next step is to tranform the data and generate a lower dimensional representation again, except that we first filter out sex chromosome counts. We also leave the first component in now, and we use 50 dimensions (rather than 6).

19956
7092

2 x 2 sparse Matrix of class "dgCMatrix"
AGCGATAGAACGAATTCGAGAACCGGAGCCTATCCT
chr2L_5543_5980                                    .
chr2L_7488_8077                                    .
AGCGATAGAACGAATTCGAGTCATAGCCGTACTGAC
chr2L_5543_5980                                    .
chr2L_7488_8077                                    .

2 x 2 sparse Matrix of class "dgCMatrix"
AGCGATAGAACGAATTCGAGAACCGGAGCCTATCCT
chr2L_5543_5980                                    .
chr2L_7488_8077                                    .
AGCGATAGAACGAATTCGAGTCATAGCCGTACTGAC
chr2L_5543_5980                                    .
chr2L_7488_8077                                    .


Next, we use t-SNE to visualize the data. We feed this lower dimensional representation of the data directly into the Rtsne package.

To identify clusters of cells, we use the density peak algorithm.

Distance cutoff calculated to 3.405709

the length of the distance: 25144686


The density peak algorithm requires you to set two parameters - “delta” and “rho”. For each data point, the algorithm calculates a local density of other points within some set distance and the minimum distance to the next point that has a higher local density. On the basis of these two values, you can choose a set of points that are outliers both in local density and the distance to another point with a higher density, which become cluster “peaks”. Below, we show you the distribution of these two values in our data set and where we decided to draw the cutoff. You can read more about this algorithm here.

Finally, we plot the t-SNE plots and show how the points are related to our original LSI clustering, the assigned sex of each cell (you’ll need to download an additional file for this), and curren density peak clusters.

## Use case 3: Differential accessibility

Having defined clusters of cells on the basis of chromatin accessibility profiles, you might like to know which sites in the genome are different between clusters. To figure that out, we use the linear modelling framework implemented in Monocle. This use case will walk you through the steps to define which sites are open in mesodermal cells relative to all other cells in the time point, but the same framework could be applied to any comparison you like.

Before getting into the analysis, we need to take care of a few housekeeping things. First, we want to load Monocle to run the differential accessibility tests. To do this we detach the legacy version of irlba and then load Monocle. Finally, we created a little patch to Monocle that reports beta values from the differential accessibility tests so that we can distinguish sites that are opening from sites that are closing.

NOTE: This code is implemented to run on multiple processors on a computing cluster, you may need special modifications to get this working in your preferred computing environment.

Warning message:


Because the summit matrix includes data for more cells, we first subset it cells that we have germ layer predictions for. We then filter out any sites that weren’t observed in at least 50 cells.

used     (Mb)     gc trigger     (Mb)     max used     (Mb)
Ncells     1747068     93.4     11554252     617.1     14442815     771.4
Vcells     443259864     3381.9     2642652780     20161.9     3302654927     25197.3

We next want to load our site x cell matrix into a Cell Data Set (‘CDS’), a format that allows us to use some of the convenience functions available in monocle. The first step is to set up a “phenotype” data frame. This is a framework for storing all kinds of information about individual cells.

We next set up a “feature” data frame - a framework for storing information about data features (in this case, sites).

And then we can combine the feature data frame and phenotype data frame with the raw site x cell matrix into a CDS. Because the data are binary calls of insertions in sites, we set the expressionFamily to binomialff.

Warning message in newCellDataSet(da_mat_final, featureData = fda, phenoData = pda, :
“Warning: featureData must contain a column verbatim named 'gene_short_name' for certain functions”Warning message in newCellDataSet(da_mat_final, featureData = fda, phenoData = pda, :
“Warning: featureData must contain a column verbatim named 'gene_short_name' for certain functions”Warning message in newCellDataSet(da_mat_final, featureData = fda, phenoData = pda, :
“Warning: featureData must contain a column verbatim named 'gene_short_name' for certain functions”


Here we just delete some variables that we won’t need anymore to free up RAM.

used     (Mb)     gc trigger     (Mb)     max used     (Mb)
Ncells     1810190     96.7     9243401     493.7     14442815     771.4
Vcells     443430760     3383.2     2114122224     16129.5     3302654927     25197.3

And now we’re ready to run the likelihood ratio tests to identify differentially accessible sites. Monocle provides a convenient wrapper for doing this with the differentialGeneTest function. This step can take several minutes to run.

Warning message:
“closing unused connection 18 (<-localhost:11614)”Warning message:
“closing unused connection 17 (<-localhost:11614)”Warning message:
“closing unused connection 16 (<-localhost:11614)”Warning message:
“closing unused connection 15 (<-localhost:11614)”Warning message:
“closing unused connection 14 (<-localhost:11614)”Warning message:
“closing unused connection 13 (<-localhost:11614)”Warning message:
“closing unused connection 12 (<-localhost:11614)”Warning message:
“closing unused connection 11 (<-localhost:11614)”Warning message:
“closing unused connection 10 (<-localhost:11614)”Warning message:
“closing unused connection 9 (<-localhost:11614)”Warning message:
“closing unused connection 8 (<-localhost:11614)”Warning message:
“closing unused connection 7 (<-localhost:11614)”Warning message:
“closing unused connection 6 (<-localhost:11614)”Warning message:
“closing unused connection 5 (<-localhost:11614)”Warning message:
“closing unused connection 4 (<-localhost:11614)”


status     family     pval     beta     qval     Peak
24326     OK     binomialff     0.000000e+00     2.778820     0.000000e+00     chr3L_10547945_10548368
38943     OK     binomialff     0.000000e+00     3.194512     0.000000e+00     chr3R_18852360_18852769
8709     OK     binomialff     0.000000e+00     3.442576     0.000000e+00     chr2L_20476729_20477328
8716     OK     binomialff     1.388543e-311     3.344137     1.809584e-307     chr2L_20486338_20486627
12015     OK     binomialff     1.716059e-305     3.306066     1.789129e-301     chr2R_5822850_5823121
19994     OK     binomialff     4.726260e-298     2.672527     4.106254e-294     chr3L_589563_589900

With this test, we find 19,535 sites are differentially accessible (1% FDR) in myogenic mesoderm cells relative to all other cells in the embryo. 8,398 of these are more accessible in myogenic mesoderm than other cells, while 11,137 are less accessible in myogenic mesoderm. We can see that the sites in the top of the list are consistent with what you’d expect (for example, the second most significant hit happens to be the promoter of lmd, a known master regulator of “fusion-competent myoblasts”.

## Use case 4: Order cells in development

One very powerful aspect of single-cell technologies is that we can use them to trace the developmental trajectories of cells. In the Cusanovich et al. paper, we showed that sci-ATAC-seq data can be used to identify developmental trajectories. Here we walk you through the basic steps of that analysis.

For this analysis, we turn to the 2-4 hour time point, so we’ll load a cds for cells from that time point. On the basis of the t-SNE analysis of these cells we were able to classify the cells into the following 7 cell types: ‘Unknown’, ‘Collisions’, ‘Blastoderm’, ‘Neural’, ‘Ectoderm’, ‘Mesoderm’, ‘Endoderm’. However, we really wanted to see how cells transitioned from the blastoderm state to the germ layers, so we used monocle to arrange cells in a progression to learn about how the regulatory landscape of the genome changed through development.

One note: instead of using the counts of reads mapping to individual sites of accessibility for this analysis, we binned all the sites that were near eachother in order to improve the sparsity of the data.

To run through this section, we need several files, so we’ll download a tarball and unpack it before proceeding.

Blastoderm Collisions   Ectoderm   Endoderm   Mesoderm     Neural    Unknown
2662        164       1865        575       1496        103        358


After unpacking all those files, we want to collect the top 100 DA sites for each of the cell types found in this time point (excluding sites from the ‘Collisions’ and ‘Unknown’ categories). After collecting those sites we need to determine which binned sites they overlap.

We then update the 2-4 hour CDS to track what the germ layer assingments are and to establish which sites should be used for ordering cells (the top DA sites we defined above).

Having set up the CDS, now we use DDRTree to learn the developmental trajectory and then arrange cells on that trajectory with the oderCells function.

Now we can plot the cells on the trajectory to evaluate how the trajectory relates to our germ layer assignments. Please note that either of the axes may be flipped for you which will affect where monocle puts the root state. With the set of parameters we’ve chosen, you can see that at the beginning of pseudotime all the cells are following a single trajectory and most of the cells here are from the “Blastoderm” clusters. However, as pseudotime progresses, we get three branches - each primarily made up of cells from one of the three major germ layers observed at this time point.

For more examples of analyses that you can do with monocle, please visit the monocle website. If you’d like to interact with the processed data and check out browser tracks for the clusters of cells we’ve identified, please visit our companion site. Finally, if you’d like, you can download the raw data here, and we’ve also made the code we used to process the raw data available on github.

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      splines   stats4    parallel  stats     graphics  grDevices
[8] utils     datasets  methods   base

other attached packages:
[1] monocle_2.5.3       DDRTree_0.1.4       irlba_2.2.1
[4] VGAM_1.0-3          ggplot2_2.2.1       Biobase_2.30.0
[7] BiocGenerics_0.16.1 densityClust_0.3    Rtsne_0.13
[10] gplots_3.0.1        proxy_0.4-16        Matrix_1.2-10

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11           lattice_0.20-35        gtools_3.5.0
[4] assertthat_0.2.0       digest_0.6.12          IRdisplay_0.4.4
[7] slam_0.1-35            R6_2.2.2               plyr_1.8.4
[10] repr_0.12.0            qlcMatrix_0.9.5        evaluate_0.10.1
[13] rlang_0.1.1            lazyeval_0.2.0         uuid_0.1-2
[16] data.table_1.10.4      gdata_2.18.0           combinat_0.0-8
[19] labeling_0.3           stringr_1.2.0          igraph_1.1.2
[22] pheatmap_1.0.8         munsell_0.4.3          pkgconfig_2.0.1
[25] tibble_1.3.3           matrixStats_0.50.1     crayon_1.3.2
[28] dplyr_0.7.1            bitops_1.0-6           jsonlite_1.5
[31] gtable_0.2.0           magrittr_1.5           scales_0.4.1
[34] KernSmooth_2.23-15     stringi_1.1.5          reshape2_1.4.2
[37] bindrcpp_0.2           limma_3.26.9           IRkernel_0.8.6.9000
[40] fastICA_1.2-1          RColorBrewer_1.1-2     tools_3.2.1
[43] Cairo_1.5-9            glue_1.1.1             HSMMSingleCell_0.104.0
[46] colorspace_1.3-2       cluster_2.0.6          caTools_1.17.1
[49] pbdZMQ_0.2-6           bindr_0.1