They obtained single cell ATAC for ~3000 human hematopoietic cells. All the cells were FAC sorted, thus Buenrostro et al. provide a ground truth for cell identity that can be used to check data processing, cell type clustering and differentiation trajectory analyses.
You can download the data here: https://drive.google.com/open?id=1q-JPHAK8KR44x4bXPNvVt6IdZ5y8wmrk
import anndata as ad
import episcanpy.api as epi
import scanpy as sc
import numpy as np
# settings for the plots
sc.set_figure_params(scanpy=True, dpi=80, dpi_save=250,
frameon=True, vector_friendly=True,
color_map="YlGnBu", format='pdf', transparent=False,
ipython_format='png2x')
# specify the directory
DATADIR = ''
# Load the data
adata = ad.read(DATADIR+'GSE96769_anndata.h5ad')
adata
For a more detailed description of the AnnData object, check out: https://anndata.readthedocs.io/en/stable/
adata.obs_names
The cell names are not specified in adata.obs_names. However, you can find it in the metadata.
# current cell annotations
adata.obs
adata.var_names
This is the same scenario as before. The feature names are stored in the variable metadata.
adata.var
# renaming the features using the metadata
adata.var_names = adata.var['region']
adata.var_names
adata.X
If the matrix is not binary:
epi.pp.binarize(adata, copy=False)
# Histogram showing the number of cells with a given number of features
epi.pp.coverage_cells(adata, binary=True, bins=100)
# Putting the number of feature in log scale
#epi.pp.filter_cells(adata, min_features=1)
#epi.pp.coverage_cells(adata, binary=True, bins=100, log=True)
# Histogram showing in how many cells you can find the different features
epi.pp.commonness_features(adata, binary=True)
To put the number of cells sharing a feature in log scale. You first need to make sure, all features are open in at least one cell.
# number of cells open for a feature in log scale
epi.pp.filter_features(adata, min_cells=1)
epi.pp.commonness_features(adata, binary=True, log=True)
len(adata.var[adata.var["commonness"] >= 50])/len(adata.var)
adata.obs["sum_peaks"] = adata.X.sum(axis=1)
len(adata.obs[adata.obs["sum_peaks"] <= 500])
# removing features that are too lowly covered
epi.pp.filter_features(adata, min_cells=50)
adata
# removing cells with too little chromatin information
epi.pp.filter_cells(adata, min_features=500)
adata
To limit memory usage and running time, it is useful to select the most variable features for later dimensionality reduction.
We arbitrarily decided for the top 50000 most variable features. What happens if a different number of features is chosen ?
# selecting the top 50000 most variable features
adata50 = epi.pp.select_var_feature(adata, nb_features=50000, copy=True)
adata50
# addifional filtering - if necessary
epi.pp.filter_features(adata50, min_cells=100)
epi.pp.filter_cells(adata50, min_features=1000)
adata50
# save the temporary filtered matrix
adata50.write(DATADIR+'GSE96769_anndata_top50000features_filtered.h5ad')
# load the filtered data
adata50=ad.read(DATADIR+'GSE96769_anndata_top50000features_filtered.h5ad')
# Principal Component Analysis
epi.pp.pca(adata50, n_comps=100)
epi.pl.pca_overview(adata50)
sc.pl.pca(adata50, color=['celltype11', 'celltype8', 'n_features'], wspace=0.4)
# computing a neighborhood graph
epi.pp.neighbors(adata50)
# Embed the neighborhood graph using UMAP
epi.tl.umap(adata50)
adata50
sc.pl.umap(adata50, color=['celltype11', 'celltype8', 'n_features'], wspace=0.4)
# Linear regression on the number of features per cell (remove coverage effect)
epi.pp.regress_out(adata50, "nb_features")
# recompute PCA, neighborhood graph, tSNE and UMAP
epi.pp.lazy(adata50)
sc.pl.pca(adata50, color=['celltype11', 'celltype8', 'nb_features'])
sc.pl.umap(adata50, color=['celltype11', 'celltype8', 'nb_features'])
In scRNA-seq you need clustering and then use marker genes to identify the cell types. Hence, you need to make use of the assumption that chromatin openness relates to gene expression. For this specific data type you need to know which promoters lie in which window, you need a table of closest genes per promoter, and a list of marker genes for each cell type you're interested in. You could also construct the count matrix directly for promoters or enhancers.
epi.pp.pca(adata50, n_comps=100, svd_solver='arpack')
epi.pp.neighbors(adata50)
epi.tl.umap(adata50)
# Louvain clustering
epi.tl.louvain(adata50, resolution=0.5)
sc.pl.umap(adata50, color=["louvain","celltype8"], wspace=0.4)
epi.tl.rank_features(adata50, groupby="louvain", n_features=25)
adata50
This step requires to load a gtf file, to provide information for the different genomic regions used in the adata. You will need more information on how to do this using Fastgenomics. Just ask!
sc.pl.umap(adata50, color=["louvain","celltype8"], wspace=0.4)
sc.tl.paga(adata50, groups='louvain')
sc.pl.paga(adata50)
epi.tl.diffmap(adata50)
epi.tl.draw_graph(adata50)
sc.pl.diffmap(adata50, wspace=0.4, color=["louvain","celltype8"])
sc.pl.draw_graph(adata50, wspace=0.4,color=["louvain", 'celltype8'])
# root on stem cell and progenitors
adata50.uns['iroot'] = np.flatnonzero(adata50.obs['louvain'] == '1')[0]
sc.tl.dpt(adata50)
sc.pl.diffmap(adata50, color=["dpt_pseudotime", 'celltype8'])
Among the cells that are in the Buenrostro data. Is there some cell types that might be contaminant? That might affect your attempt to find lineage specification ?