data available on the 10x Genomics website here: https://support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_pbmc_5k_nextgem
# downloading the fragment file --> atac_pbmc_5k_nextgem_fragments.tsv.gz
#!wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_fragments.tsv.gz
# downloading the index of the fragment file --> atac_pbmc_5k_nextgem_fragments.tsv.gz.tbi
#!wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_fragments.tsv.gz.tbi
# downloading per barcode metrics --> atac_pbmc_5k_nextgem_singlecell.csv
#!wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
# to install the latest development version of episcanpy
#!pip install git+https://github.com/colomemaria/epiScanpy@update_load_features
# to install the latest stable version of episcanpy
#!pip install episcanpy
import episcanpy as epi
import anndata as ad
To build a count matrix using scATAC-seq data, it is important to choose a feature space on which to build the count matrix.
epiScanpy offers to load different set of custom features (bed, gtf and gff input files accepted), or to load peaks called using macs2 (see below on how to call peaks). Alternatively, epiScanpy can also generate windows of a given size.
test_bed_file = "tmp_file_merged_peaks.bed"
features = epi.ct.load_features(test_bed_file, path="", input_file_format=None, sort=True)
test_gff_file = "HGAP3_Tb427v10.gff"
features = epi.ct.load_features_gff(test_gff_file,
filter_per_source=["AUGUSTUS", "Pfam"],
filter_per_feature_type='gene')
test_gtf_file = "gencode.vM17.basic.annotation.gtf"
features = epi.ct.load_features_gtf(test_gtf_file)
# calling peaks --> it will create file in the local directory
!macs2 callpeak --nomodel --keep-dup all --extsize 200 --shift -100 -t ./atac_pbmc_5k_nextgem_fragments.tsv.gz
# load the peaks and normalise their size to 500 (250*2)
peaks = epi.ct.load_peaks("NA_peaks.narrowPeak")
epi.ct.norm_peaks(peaks, extension=250)
INFO @ Wed, 24 Mar 2021 11:13:57: # Command line: callpeak --nomodel --keep-dup all --extsize 200 --shift -100 -t ./atac_pbmc_5k_nextgem_fragments.tsv.gz # ARGUMENTS LIST: # name = NA # format = AUTO # ChIP-seq file = ['./atac_pbmc_5k_nextgem_fragments.tsv.gz'] # control file = None # effective genome size = 2.70e+09 # band width = 300 # model fold = [5, 50] # qvalue cutoff = 5.00e-02 # The maximum gap between significant sites is assigned as the read length/tag size. # The minimum length of peaks is assigned as the predicted fragment length "d". # Larger dataset will be scaled towards smaller dataset. # Range for calculating regional lambda is: 10000 bps # Broad region calling is off # Paired-End mode is off INFO @ Wed, 24 Mar 2021 11:13:57: #1 read tag files... INFO @ Wed, 24 Mar 2021 11:13:57: #1 read treatment tags... INFO @ Wed, 24 Mar 2021 11:13:57: Detected format is: BED INFO @ Wed, 24 Mar 2021 11:13:57: * Input file is gzipped. INFO @ Wed, 24 Mar 2021 11:13:59: 1000000 INFO @ Wed, 24 Mar 2021 11:14:02: 2000000 INFO @ Wed, 24 Mar 2021 11:14:03: 3000000 INFO @ Wed, 24 Mar 2021 11:14:06: 4000000 INFO @ Wed, 24 Mar 2021 11:14:08: 5000000 INFO @ Wed, 24 Mar 2021 11:14:10: 6000000 INFO @ Wed, 24 Mar 2021 11:14:11: 7000000 INFO @ Wed, 24 Mar 2021 11:14:13: 8000000 INFO @ Wed, 24 Mar 2021 11:14:15: 9000000 INFO @ Wed, 24 Mar 2021 11:14:17: 10000000 INFO @ Wed, 24 Mar 2021 11:14:20: 11000000 INFO @ Wed, 24 Mar 2021 11:14:22: 12000000 INFO @ Wed, 24 Mar 2021 11:14:25: 13000000 INFO @ Wed, 24 Mar 2021 11:14:27: 14000000 INFO @ Wed, 24 Mar 2021 11:14:28: 15000000 INFO @ Wed, 24 Mar 2021 11:14:30: 16000000 INFO @ Wed, 24 Mar 2021 11:14:32: 17000000 INFO @ Wed, 24 Mar 2021 11:14:34: 18000000 INFO @ Wed, 24 Mar 2021 11:14:36: 19000000 INFO @ Wed, 24 Mar 2021 11:14:38: 20000000 INFO @ Wed, 24 Mar 2021 11:14:40: 21000000 INFO @ Wed, 24 Mar 2021 11:14:41: 22000000 INFO @ Wed, 24 Mar 2021 11:14:43: 23000000 INFO @ Wed, 24 Mar 2021 11:14:45: 24000000 INFO @ Wed, 24 Mar 2021 11:14:46: 25000000 INFO @ Wed, 24 Mar 2021 11:14:48: 26000000 INFO @ Wed, 24 Mar 2021 11:14:50: 27000000 INFO @ Wed, 24 Mar 2021 11:14:52: 28000000 INFO @ Wed, 24 Mar 2021 11:14:54: 29000000 INFO @ Wed, 24 Mar 2021 11:14:56: 30000000 INFO @ Wed, 24 Mar 2021 11:14:58: 31000000 INFO @ Wed, 24 Mar 2021 11:15:00: 32000000 INFO @ Wed, 24 Mar 2021 11:15:02: 33000000 INFO @ Wed, 24 Mar 2021 11:15:04: 34000000 INFO @ Wed, 24 Mar 2021 11:15:05: 35000000 INFO @ Wed, 24 Mar 2021 11:15:07: 36000000 INFO @ Wed, 24 Mar 2021 11:15:11: 37000000 INFO @ Wed, 24 Mar 2021 11:15:13: 38000000 INFO @ Wed, 24 Mar 2021 11:15:15: 39000000 INFO @ Wed, 24 Mar 2021 11:15:18: 40000000 INFO @ Wed, 24 Mar 2021 11:15:20: 41000000 INFO @ Wed, 24 Mar 2021 11:15:22: 42000000 INFO @ Wed, 24 Mar 2021 11:15:24: 43000000 INFO @ Wed, 24 Mar 2021 11:15:25: 44000000 INFO @ Wed, 24 Mar 2021 11:15:27: 45000000 INFO @ Wed, 24 Mar 2021 11:15:29: 46000000 INFO @ Wed, 24 Mar 2021 11:15:30: 47000000 INFO @ Wed, 24 Mar 2021 11:15:32: 48000000 INFO @ Wed, 24 Mar 2021 11:15:34: 49000000 INFO @ Wed, 24 Mar 2021 11:15:35: 50000000 INFO @ Wed, 24 Mar 2021 11:15:37: 51000000 INFO @ Wed, 24 Mar 2021 11:15:39: 52000000 INFO @ Wed, 24 Mar 2021 11:15:41: 53000000 INFO @ Wed, 24 Mar 2021 11:15:42: 54000000 INFO @ Wed, 24 Mar 2021 11:15:44: 55000000 INFO @ Wed, 24 Mar 2021 11:15:46: 56000000 INFO @ Wed, 24 Mar 2021 11:15:48: 57000000 INFO @ Wed, 24 Mar 2021 11:15:49: 58000000 INFO @ Wed, 24 Mar 2021 11:15:51: 59000000 INFO @ Wed, 24 Mar 2021 11:15:52: 60000000 INFO @ Wed, 24 Mar 2021 11:15:54: 61000000 INFO @ Wed, 24 Mar 2021 11:15:56: 62000000 INFO @ Wed, 24 Mar 2021 11:15:58: 63000000 INFO @ Wed, 24 Mar 2021 11:16:00: 64000000 INFO @ Wed, 24 Mar 2021 11:16:02: 65000000 INFO @ Wed, 24 Mar 2021 11:16:04: 66000000 INFO @ Wed, 24 Mar 2021 11:16:05: 67000000 INFO @ Wed, 24 Mar 2021 11:16:07: 68000000 INFO @ Wed, 24 Mar 2021 11:16:09: 69000000 INFO @ Wed, 24 Mar 2021 11:16:11: 70000000 INFO @ Wed, 24 Mar 2021 11:16:13: 71000000 INFO @ Wed, 24 Mar 2021 11:16:14: 72000000 INFO @ Wed, 24 Mar 2021 11:16:16: 73000000 INFO @ Wed, 24 Mar 2021 11:16:17: 74000000 INFO @ Wed, 24 Mar 2021 11:16:20: 75000000 INFO @ Wed, 24 Mar 2021 11:16:22: 76000000 INFO @ Wed, 24 Mar 2021 11:16:23: 77000000 INFO @ Wed, 24 Mar 2021 11:16:26: 78000000 INFO @ Wed, 24 Mar 2021 11:16:27: 79000000 INFO @ Wed, 24 Mar 2021 11:16:29: 80000000 INFO @ Wed, 24 Mar 2021 11:16:31: 81000000 INFO @ Wed, 24 Mar 2021 11:16:34: 82000000 INFO @ Wed, 24 Mar 2021 11:16:36: 83000000 INFO @ Wed, 24 Mar 2021 11:16:39: 84000000 INFO @ Wed, 24 Mar 2021 11:16:41: 85000000 INFO @ Wed, 24 Mar 2021 11:16:42: 86000000 INFO @ Wed, 24 Mar 2021 11:16:44: 87000000 INFO @ Wed, 24 Mar 2021 11:16:46: 88000000 INFO @ Wed, 24 Mar 2021 11:16:48: 89000000 INFO @ Wed, 24 Mar 2021 11:16:51: 90000000 INFO @ Wed, 24 Mar 2021 11:16:53: #1 tag size is determined as 150 bps INFO @ Wed, 24 Mar 2021 11:16:53: #1 tag size = 150.0 INFO @ Wed, 24 Mar 2021 11:16:53: #1 total tags in treatment: 90498340 INFO @ Wed, 24 Mar 2021 11:16:53: #1 finished! INFO @ Wed, 24 Mar 2021 11:16:53: #2 Build Peak Model... INFO @ Wed, 24 Mar 2021 11:16:53: #2 Skipped... INFO @ Wed, 24 Mar 2021 11:16:53: #2 Use 200 as fragment length INFO @ Wed, 24 Mar 2021 11:16:53: #2 Sequencing ends will be shifted towards 5' by 100 bp(s) INFO @ Wed, 24 Mar 2021 11:16:53: #3 Call peaks... INFO @ Wed, 24 Mar 2021 11:16:53: #3 Pre-compute pvalue-qvalue table... INFO @ Wed, 24 Mar 2021 11:19:53: #3 Call peaks for each chromosome... INFO @ Wed, 24 Mar 2021 11:21:22: #4 Write output xls file... NA_peaks.xls INFO @ Wed, 24 Mar 2021 11:21:22: #4 Write peak in narrowPeak format file... NA_peaks.narrowPeak INFO @ Wed, 24 Mar 2021 11:21:23: #4 Write summits bed file... NA_summits.bed INFO @ Wed, 24 Mar 2021 11:21:23: Done!
# quick look at the file
!head NA_peaks.narrowPeak
chr1 237619 237838 NA_peak_1 230 . 6.36068 25.1441 23.0551 44 chr1 565193 565410 NA_peak_2 388 . 8.56744 41.0653 38.8709 132 chr1 569269 569499 NA_peak_3 1794 . 23.2359 182.034 179.409 61 chr1 713651 714402 NA_peak_4 13223 . 25.0938 1326.22 1322.39 403 chr1 752254 752832 NA_peak_5 2531 . 21.7928 255.88 253.125 392 chr1 762357 763179 NA_peak_6 10143 . 25.9794 1017.91 1014.39 515 chr1 779439 780172 NA_peak_7 378 . 8.43763 40.0713 37.8825 586 chr1 793385 793640 NA_peak_8 180 . 5.58182 20.0898 18.0455 81 chr1 804943 805512 NA_peak_9 7684 . 30.203 771.717 768.419 318 chr1 832576 832964 NA_peak_10 172 . 5.45201 19.2808 17.2444 197
# load the peaks and normalise their size to 500 (250*2)
peaks = epi.ct.load_peaks("NA_peaks.narrowPeak")
epi.ct.norm_peaks(peaks, extension=250)
windows = epi.ct.make_windows(5000, chromosomes='human') # generate 5000bp windows across the huamn genome.
In this example we are using the peak features and only 2 threads. For a faster result, you should consider using more threads.
adata = epi.ct.bld_mtx_bed(fragment_file="atac_pbmc_5k_nextgem_fragments.tsv.gz",
feature_region=peaks,
thread=2,
save='atac_pbmc_5k_nextgem_fragments_macs2_peaks.h5ad')
Chromosome 1: 684.0747020244598 sec Chromosome 2: 503.49666714668274 sec Chromosome 3: 420.7269206047058 sec Chromosome 4: 267.5270185470581 sec Chromosome 5: 335.92078971862793 sec Chromosome 6: 483.4750199317932 sec Chromosome 7: 381.748735666275 sec Chromosome 8: 412.0700845718384 sec Chromosome 9: 281.22125816345215 sec Chromosome 10: 415.1694209575653 sec Chromosome 11: 479.1664354801178 sec Chromosome 12: 608.7532119750977 sec Chromosome 13: 169.2280580997467 sec Chromosome 14: 213.66552996635437 sec Chromosome 15: 276.6346945762634 sec Chromosome 16: 424.7526617050171 sec Chromosome 17: 550.9005522727966 sec Chromosome 18: 147.62225031852722 sec Chromosome 19: 571.224552154541 sec Chromosome 20: 233.46129488945007 sec Chromosome 21: 92.89729404449463 sec Chromosome 22: 178.4060080051422 sec Chromosome X: 196.7464315891266 sec Chromosome Y: 66.34237909317017 sec All data contains: AnnData object with n_obs × n_vars = 400798 × 107529 Total time is 8447.611531734467 sec
adata = ad.read("atac_pbmc_5k_nextgem_fragments_macs2_peaks.h5ad")
adata
AnnData object with n_obs × n_vars = 400798 × 107529
### filtering low quality barcodes
epi.pp.filter_cells(adata, min_features=100)
adata
AnnData object with n_obs × n_vars = 5876 × 107529 obs: 'nb_features'
# building the gene activity matrix
activity = epi.tl.geneactivity(adata,
gtf_file="gencode.v28.annotation.gtf",
annotation='HAVANA')
activity
activity.write("atac_pbmc_5k_nextgem_fragments_gene_activity.h5ad")