Welcome to cgr_view’s documentation!

Indices and tables

cgr

A module for creating, saving and drawing k-mer matrices and Chaos Game Representations (CGRs) of nucleotide sequences

Prerequisites

  • Jellyfish

An external program for counting k-mers. Must be accessible on the path. You can install from conda as follows:

conda install -c bioconda jellyfish

Quickstart

  • Input fasta file, get cgr

    • one cgr for each entry in the fasta file

    cgr.from_fasta("my_seqs.fa", outfile = "my_cgrs", k = 7)
    
    • just one cgr with all entries in the fasta file (eg for genomes and contigs)

    cgr.from_fasta("my_genome.fa", outfile = "genome_cgr", k = 7, as_single = True)
    

Workflow:

  1. make kmer count db in Jellyfish from fasta -> generate cgr from db.

  2. optionally merge cgrs into single cgr as separate channels

  3. stack all composed cgrs into an array of cgrs

  4. save as numpy binary (.npy) files

Usage:

  1. Import module

    import cgr
    
  2. Make kmer count db

    cgr.run_jellyfish("test_data/NC_012920.fasta", 11, "11mer.jf")
    cgr.run_jellyfish("test_data/NC_012920.fasta", 10, "10_mer.jf")
    
  1. Load CGRs from kmer count db

    cgr1 = cgr.cgr_matrix("/Users/macleand/Desktop/athal-5-mers.jf")
    cgr2 = cgr.cgr_matrix("test_data/five_mer.jf")
    
  2. Draw a cgr and save to file

    • just one cgr, can choose colour (value of ‘h’) and which channel to put cgr in

    cgr.draw_cgr(cgr1, h = 0.64, v = 1.0, out = "my_cgr.png", resize = 1000, main = "s" )
    
    • two cgrs, first in tuple goes in ‘h’, second goes in ‘s’. Can set ‘v’

    cgr.draw_cgr( (cgr1, cgr1), v = 1.0, out = "two_cgrs.png")
    
    • three cgrs ‘h’,’s’ and ‘v’ are assigned as order in tuple

    cgr.draw_cgr( (cgr1, cgr1, cgr1) )
    
  3. Save a single cgr into a text file

    cgr.save_as_csv(cgr1, file = "out.csv")
    
  4. Join n cgrs into one, extending the number of channels …

    merged_cgr = cgr.join_cgr( (cgr1, cgr2, ... ) )
    
  5. Write to file (numpy binary)

    cgr.save_cgr("my_cgr, merged_cgr )
    
  6. Input fasta file, get cgr
    • one cgr for each entry in the fasta file

    cgr.from_fasta("my_seqs.fa", outfile = "my_cgrs", k = 7)
    
    • just one cgr with all entries in the fasta file (eg for genomes and contigs)

    cgr.from_fasta("my_genome.fa", outfile = "genome_cgr", k = 7, as_single = True)
    
cgr.blocky_scale(im: numpy.ndarray, nR: int, nC: int) → numpy.ndarray

Upscales an array in preparation for drawing. By default the array is a square with sqrt(k ** 4) wide and high. For many values of k this will be too small to view well on a monitor. This function does a scale operartion that increases the size of the image by simply increasing the pixels in each square.

Param

im numpy.ndarray – the image to be scaled

Param

nR int – the number of height pixels to be in the final image

Param

nC int – the number of width pixels to be in the final image

Returns

numpy.ndarray – upscaled image

cgr.cgr_matrix(jellyfish: str) → scipy.sparse.dok.dok_matrix

Main function, creates the cgr matrix, a sparse matrix of type scipy.sparse.dok_matrix

Runs the cgr process on a jellyfish file and returns a scipy.sparse.dok_matrix object of the CGR with dtype int32 Only observed kmers are represented, absent coordinates mean 0 counts for the kmer at that coordinate.

Param

jellyfish str – jellyfish DB file

Returns

scipy.sparse.dok_matrix – sparse matrix of kmer counts

cgr.draw(rgb: numpy.ndarray) → None

renders RGB array on the screen.

Param

rgb numpy.ndarray – RGB channel image

cgr.draw_cgr(cgr_matrices: scipy.sparse.dok.dok_matrix, h: float = 0.8, s: float = 0.5, v: float = 1.0, main: str = 's', show: bool = True, write: bool = True, out: str = 'cgr.png', resize: bool = False) → None

Draws cgrs to a file. Allows user to set which of up to 3 provided cgr matrices goes in at which of the H, S or V image channels. Typically for one channel, select h to specify the image colour and set cgr as s to change that colour according to counts in cgr. Set v to 1.0 for maximum brightness.

Param

cgr_matrices scipy.sparse.dok_matrix or tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Tuple provides order for HSV channels of image.

Param

h float – (0..1) value for h channel if not used for cgr data

Param

s float – (0..1) value for s channel if not used for cgr data

Param

v float – (0..1) value for v channel if not used for cgr data

Param

main str – the channel to place the cgr matrix in if a single cgr matrix is passed

Param

show bool – render CGR picture to screen

Param

write – write CGR picture to file

Param

out str – filename to write to

Param

resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height

Returns

None

cgr.draw_single_cgr(cgr_matrix, h=0.8, s=0.5, v=1.0, main='s', show=True, write=True, out='cgr.png', resize=False)

draws a single cgr image, selecting channels and resizing as appropriate

Param

cgr_matrix scipy.sparse.dok_matrix to be drawn.

Param

h float – (0..1) value for h channel if not used for cgr data

Param

s float – (0..1) value for s channel if not used for cgr data

Param

v float – (0..1) value for v channel if not used for cgr data

Param

main str – the channel to place the cgr matrix in

Param

show bool – render CGR picture to screen

Param

write – write CGR picture to file

Param

out str – filename to write to

Param

resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height

Returns

None

cgr.draw_three_cgrs(cgr_matrices, show=True, write=True, out='cgr.png', resize=False)

Draws a tuple of 3 cgr matrices as an image

Param

cgr_matrices tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Tuple provides order for HSV channels of image

Param

show bool – render CGR picture to screen

Param

write – write CGR picture to file

Param

out str – filename to write to

Param

resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height

Returns

None

cgr.draw_two_cgrs(cgr_matrices, v=1.0, show=True, write=True, out='cgr.png', resize=False)

draws two cgr matrices into a single image. first matrix of tuple becomes h channel, second of tuple becomes v channel

Param

cgr_matrices tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn.

Param

v float – (0..1) value for v channel

Param

show bool – render CGR picture to screen

Param

write – write CGR picture to file

Param

out str – filename to write to

Param

resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height

Returns

None

cgr.estimate_genome_size(fasta: str) → int

Guesses genome size from fasta file size, assumes 1 byte ~= 1 base

Param

fasta str – a fasta file

Returns

int – approximate genome size in nucleotides

cgr.from_fasta(fasta_file: str, outfile: str = 'my_cgrs', as_single: bool = False, k: int = 7) → None

Factory function to load in a FASTA file and generate a binary .npy of CGRs

Parameters
  • fasta_file – str FASTA file to load

  • outfile – str outfile to save

  • as_single – bool If True treats all entries as single sequence and return one CGR. If False, treats all entries individually and returns many CGR

  • k – int length of kmer to use

Returns

None

cgr.get_coord(kmer: str) → List[int]

given a kmer gets the coordinates of the box position in the cgr grid, returns as list [x,y] of coordinates

Param

kmer str – a string of nucleotides

Returns

coords [x,y] – the x,y positions of the nucleotides in the cgr

cgr.get_grid_size(k: int) → int

returns the grid size (total number of elements for a cgr of k length kmers

Param

k int – the value of k to be used

Returns

int – the total number of elements in the grid

cgr.get_k(jellyfish: str) → int

asks the jellyfish file what value was used for k

Param

jellyfish str – jellyfish DB file

Returns

int – length of k used

cgr.get_kmer_list(jellyfish: str) → Generator[List, str, None]

runs jellyfish dump on a Jellyfish DB. Captures output as a generator stream. Each item returned is a list [kmer: str, count: str]

Param

jellyfish str – a Jellyfish DB file

Returns

Generator – a list of [kmer string, times_kmer_seen]

cgr.is_cgr_matrix(obj) → bool

returns true if obj is a scipy.sparse.dok.dok_matrix object

cgr.join_cgr(cgrs: tuple) → numpy.ndarray

Takes tuple of cgrs of shape (n,n) and returns one stacked array of size (n,n, len(cgrs) )

Param

cgrs tuple – tuple of cgrs to be joined

Returns

numpy.ndarray

cgr.load_npy(file: str) → numpy.ndarray

loads numpy .npy file as ndarray. Useful for restoring collections of cgrs but resulting array is not compatible directly with drawing methods here.

:param file str – numpy .npy file to load :return: numpy.ndarray

cgr.make_blanks_like(a: scipy.sparse.dok.dok_matrix, h: float = 1.0, s: float = 1.0, v: float = 1.0) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

returns tuple of numpy.ndarrays with default values of h,s and v of shape of a

Param

a scipy.sparse.dok_matrix – a cgr matrix to make blanks like

Param

h float – the values with which to fill the first numpy.ndarray

Param

s float – the values with which to fill the second numpy.ndarray

Param

v float – the values with which to fill the third numpy.ndarray

Returns

Tuple of numpy.ndarray

cgr.many_seq_record_to_many_cgr(seq_record: <module 'Bio.SeqIO.FastaIO' from '/Users/macleand/miniconda2/envs/cgr_view/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py'>, k: int) → scipy.sparse.dok.dok_matrix
Parameters
  • seq_record – Bio.SeqIO FASTA record

  • k – int size of k to use

Returns

scipy.sparse.dok_matrix

cgr.many_seq_record_to_one_cgr(fa_file: str, k: int) → scipy.sparse.dok.dok_matrix

Reads many sequence records in a FASTA file into a single CGR matrix, treating all sequence records as if they are one sequence, EG as if for a genome sequence in Chromosomes. :param fa_file: str FASTA FILE name :param k: int length of k to use :return: scipy.sparse.dok_matrix

cgr.resize_rgb_out(rgb: numpy.ndarray, resize: int) → numpy.ndarray

given an rgb image in one pixel per kmer size, increases size so that the resulting image is resize * resize pixels

Param

rgb numpy.ndarray – an RGB image array

Param

resize – pixel width (and therefore height) of resulting image

Returns

numpy.ndarray – resized image with shape (resize, resize)

cgr.run_jellyfish(fasta: str, k: int, out: str) → int

runs Jellyfish on fasta file using k kmer size, produces Jellyfish db file as side effect.

Param

fasta str – a fasta file

Param

k int – size of kmers to use

Param

out str – file in which to save kmer db

Returns

int – return code of Jellyfish subprocess

cgr.save_as_csv(cgr_matrix: scipy.sparse.dok.dok_matrix, file: str = 'cgr_matrix.csv', delimiter: str = ', ', fmt: str = '%d')

Writes simple 1 channel cgr matrix to CSV file.

See also numpy.savetxt

Param

cgr_matrix scipy.sparse.dok_matrix – cgr_matrix to save

Param

file str – filename to write to

Param

delimiter str – column separator character

Param

fmt str – text format string

Returns

None

cgr.save_cgr(cgr_obj: numpy.ndarray, outfile: str = 'cgr') → None

Saves cgr_obj as numpy .npy file. cgr_obj one or more dimensional numpy.ndarray. saves as ndarray not dokmatrix, so can be loaded in regular numpy as collections of cgrs

Parameters
  • cgr_obj – numpy.ndarray constructed cgr_object to save

  • outfile – str file

Returns

None

cgr.scale_cgr(cgr_matrix: scipy.sparse.dok.dok_matrix) → scipy.sparse.dok.dok_matrix

returns scaled version of cgr_matrix in range 0..1

Param

cgr_matrix scipy.sparse.dok_matrix – matrix to scale

Returns

scaled scipy.sparse.dok_matrix

cgr.stack_cgrs(cgr_matrices: Tuple) → numpy.ndarray

stacks cgrs of tuple of N numpy.ndarrays of shape (w,h) returns ndarray of ndarrays of shape (w,h,N)

Parameters

cgr_matrices – tuple of cgr_matrices

Returns

numpy.ndarray

cgr.write_out(rgb: numpy.ndarray, out: str, resize: int) → None

writes RGB array as image

Parameters
  • rgb – numpy.ndarray – RGB channel image

  • out – str file to write to

  • resize – bool or int. If False will not resize, if int will resize image up to that size

Returns

None