wtbox Documentation

 

Modules and Functions

 

Module checks_and_settings

This module contains functions for predefined  table settings that can check whether a certain column fulfills minimum requirements for an operation e.g. if a workflow (function) requires an integration step you can check whether the table object fulfills the minimal requirements e.g. It has the columns rtmin, rtmax, mzmin, mzmax, peakmap (+ postfix) and the rtx and mzx contain float values and the columns peakmap contains a type PeakMap object. This is ongoing work and will be further completed.

·         is_ff_metabo_table(table):  verifies whether table column names and column types correspond to featureFinderMetabo output table

·         is_integratable_table(t): Emzed function utils.integrate can be applied

·         is_integrated_table(t): Emzed function utils.integrate was applied

·         is_ms_peaks_table(t): Table is suited for targeted peaks extraxtion

·          is_isotopologue_distribution_table(t, fid='feature_id', isotope_id='mi',   fraction_col='isotope_fraction')

·         colname_type_checker(t, colnames, name2types=None): checks whether column names [colnames] of  table [t] are of type required. Optional [name2types -> dictionary {name : type}]. If None,  colnames of standard dictionary colname_type_settings (e.g. mz, rt) are checked. 

 

Module default_guis

·         adapt_ff_metabo_config(config=None, advanced=[0]): provides a GUI to adapt ffMetabo settigs by user and returns modified kwargs. Optional arguments: 1) config: dictionary with ffMetabo parameterws as keys, if  config equals None a default setup is provided 2) advanced: bool  if True all parameters can be edited else only a subset which is sufficient for most users

 

Module default_plottings

                The Module covers two frequently required plots from LC-MS data: heatmaps which and fitting curve plots.

·         plot_from_table_columns(t, id_cols, plot_fun, plot_fun_args2colnames, save_dir,      plot_colname='plot', kwargs=None): description of function arguments:

o      t: table

o      id_cols: tuple with columns to define sub_tables with plot values. Example: to   plot isotopique pattern of different compounds and samples in the same table you can assign the plots correctly via the column name tuple (sample_name, compound_name).

o      plot_fun: function to create the plot.

o      plot_fun_args2values: dictionary with plot function arguments as keys and tuple (colname, uniqueValue) as value {plotarg : (columnname, False)}, if uniqueValue   is True and all only 1 value in column single value is extracted

o      save_dir: direction where to save figure files

o      kwargs: plot function dependendent plotting parameter arguments e.g. dpi, plot type_,    colors, styles, ....

 

·         plot_heatmap(data, xlabels, ylabels, label_right=False, colorbar= True, pad_colorbar=0.1, binsize=None, title=None, save_dir=None, cmap="Greens", none_color="#777777", fig_type='png', dpi=None):

plots heatmap including axis labels, colorbar and title: parameters:

o      data : 2d numpy array

o      xlabels : list of strings, len is number of colums in data

o      ylabels : list of strings, len is number of rows in data

o      label_right : boolean, indicates if labels at right of heatmap should be plotted

o      colorbar : show colorbar, default = True

o      pad_colorbar: float in range 0 .. 1, distance of colorbar to heatmap

o      binsize : None or float in range 0..1, if this value is not None the heat map and

o      the colorbar are discretised according to this value.

o      title : None or string

o      cmap : string with name of colormap, see help(pylab.colormaps) for alternatives

o      none_color : RGB string for plotting missing values.

 

·         plot_heatmaps_from_isotope_table(t, id_col='feature_id', columns_col='time', rows_col='mi', value_col='mi_frac_corr', add_missing_values=0.0, sort_col='mi', cmap='binary', save_dir=r'P:\tmp'):

       Plots heatmap plot from table columns using function plot_heatmap and returns a dictionary containing id_col as key and correspondong plot_path as value. To add plots to table see -> wtbox.table_operations.add_plots_to_table

o      id_col: defines the subtable for each heatmap

o      columns_col: x-axis of heatmat plot

o      rows_col: y-axis of heatmap plot

o      value_col: value assigned to (x,y)

o      add_missing_values: replaces None values by user defined value

o      cmap : string with name of colormap, see help(pylab.colormaps) for alternatives

o      save_dir: saving direction of heatmap plots

 

 

 

·         plot_fitting_curve(x, y, x_fit, y_fit, x_tick_labels='', y_tick_labels='', x_ticks=None, unit='', y_ticks=None, ylabel=None, title=None, outlier_col=None, outlier_values = [(None,None)]):

 Default scatter plot of measured values x, y combined with line line plot of fitted values of fitted values x_fit, y_fit. Optional attributes:

o      x_tick_labels: list of string (in general numbers), labeling x-ticks

o      y_tick_labels: list of string (in general numbers), labeling y-ticks

o      x_ticks: list of float values, positioning x ticks

o      y_ticks: list of float values, positioning y ticks

o      y_label : name of x axis

o      title: plot title

o      outlier_value: list of (x, y) tuples that allows depicting outlier values in different color separately.

 

·         plot_fitting_curves_from_isotope_table(t, id_cols=('feature_id',), time_col='time', value_col='no_C13', fun_col='no_C13_fitting_fun', params_col='no_C13_fit_params',  add_missing_tp_as_zero=True, outlier_col=None, num_points=50, save_dir=r'P:\tmp', fig_type='png', dpi=None):

Plots fitting curve of (isotope) table t. Function returns a dictionary containing id_col as key and correspondong plot_path as value.  To add plots to table see -> wtbox.table_operations.add_plots_to_table. Function attributes:

o      id_cols: Tuple defining subtable for fitting plot. It allows combining different columns to define subgroup. Example: To plot all isotpologues of a feature you can combine the two columns as identifier by id_cols=(‘feature_id’, ‘mi’)

o      time_col: Defines column with x- values (time)

o      value_col: defines y values for plot

o      fun_col: Fitting function that was applied

o      params_col: parameters determined with fitting function to calculate y_fit

o      add,_missing_tp_ as_zero: None values will be shown as zero

o      outlier_col: a column can be provided that contains (x,y) pairs which have been excluded from fitting process.

o      num_points: Number of points calculated with fitting_function to draw line plot

 

Module feature_analysis

                The module covers most common calculation for features including most normalization methods.

·         calculate_mid(t, id_col='feature_id', quantity_col='area', result_col='mi_fraction')

Determines the mass isotopologue distribution of feature with id id_col and writes the result in column result_col. By default quantity_col is area.

 

·         normalize_peaks_by_idms(t, id_col='feature_id', mi_col='mi', quantity_col='area',  result_col='idms_ratio')

Calculates ratio of unlabeled (minimal labeled) and fully labeled mass istopologues of each feature. Attributes:

o            id_col: groupes mass isotopoloques of a compound adduct ion

o            mi_col: nominal isotopoloque mass shift

o            quantity_col: peak area (intensity)

o           result_col: name of the result columnnormalize_peaks_with_internal_standard(t, id_col='feature_id', is_id_col='internal_standards', quantity_col='area', result_col='norm_area')

 

·         normalize_peaks_with_internal_standards(t, norm_id_col='norm_with', is_id_col='internal_standard', quantity_col='area', result_col='norm_area')

Normalizes peaks area (intensity) to user defined internal standard area (intensity). Attributes:

o        norm_id_col: contains id of internal standard used for normalization

o        is_id_col: defines peak as internal standard and assigns id to the peak.

o        quantity_col: peak area (intensity)

o        result_col: name of the result column

 

·         normalize_peaks_with_TIC(t, quantity_col='area', result_col='norm_area')

Normalizes peak   area of each peak to total table area. Attributes:

o          quantity_col: peak area (intensity)

o          result_col: name of the result column

 

·         calculate_feature_labeled_fraction(t, id_col= 'feature_id', mi_col='mi', mi_fraction_col='mi_fraction_corr', result_col='labeled_fraction'):

Calculates labeled isotope fraction fromm mass isotopologue distribution. Attributes :

o            id_col: groupes mass isotopoloques of a compound adduct ion

o            mi_col: nominal isotopoloque mass shift

o            mi_fraction_col: column with mass calculated mass isotopologue distribution

o            result_col: name of the result_col (optional)

 

 

·         filter_top_n_by_fragment_ion(top_n_table, ions):

Filters a top_n table for specific fragment ions see -> feature_extraction.top_n_to_table. Attributes:

o   ions: list of fragment ion m/z values

o   mztol: m/z tolerance in units

o   min_int: minimsl peak intensity

 

·         correct_for_natural_C13(t, id_='feature_id',  isotope_id='mi', fraction_col='isotope_fraction',  mf_col='mf'):

*in place function subtracts carbon labeling originating from natural C13 abundance from from calculated mass isotopologue distribution and writes result in coilumn fraction_col with postfix ‘_corr’. The function is best suited for high mass resolution data since it takes not into account heavy stable isotopes of other elements. Attributes:

o            id_col: groupes mass isotopoloques of a compound adduct ion

o            isotope_id: nominal isotopoloque mass shift

o            fraction_col: column with mass calculated mass isotopologue distribution

o             mf_col: column with corresponding molecular formula

 

Module feature_extraction

·         targeted_peaks_ms(peakmap, peaks_table, fwhm=None, max_dev_percent=20, min_area=100, n_cpus=None):

Function targeted_peaks_ms extracts MS level 1 peaks LC-MS peaks of peakmap defined in table peaks_table using function -> wtbox.utils.enhanced_integrate. Optional  parameters are related to latter function. Peaks_table requires columns 'mzmin', 'mzmax', 'rtmin', 'rtmax'. You can also provide additional column ‘fwhm’ (see enhanced integrate).

 

·         targeted_peaks_ms2(peakmap, peaks_table, fwhm=None, max_dev_percent=20, min_area=100, n_cpus=None, step=1):

Function targeted_peaks_ms2 extracts MS level 2 peaks LC-MS peaks of peakmap defined in table peaks_table using function -> wtbox.utils.enhanced_integrate. Optional  parameters are related to latter function. Peaks_table requires columns 'mzmin', 'mzmax', 'rtmin', 'rtmax'.

 

·         metaboff_ms2(peakmap, ff_metabo_config=None): untargeted extraction of all mslevel 2 peaks applying feature_finder_metabo. If ff_metabo_config is not provided a dialog windiw will open.

 

·         top_n_to_table(pm_, rttol=5, mztol=0.003):

Function extracts all precursor_ion peaks on MS level 1 with corresponding MS level 2 spectra. Attributes:

Peakmap: PeakMap with mslevels [1, 2];

rttol: retention time tolerance in second,

mztol: m/z tolerance in units.

 

·         adapt_rt_windows(peaks_table, pm, split_by='feature_id', mslevel=None): adapt_rt_windows(peaks_table, pm, split_by='feature_id', mslevel=None) allows manually adapting retention time windows to peaks by integration for mslevels 1 and 2. peaks_table required columns: mzmin, mzmax, rtmin, rtmax, and column defined by split_by. mslevel will be ignored if peaks_table provides column ‘mslevel’. By default, mslevel is 1. If mslevel ==2 column precursor_ion  is required.  If more than one peak per group is integrated, the one with largest area will be selected to modify retention time.

 

Module fitting

1)      provided fitting functions: pt1, decay, logistic, double_logistic_model, double_pt1_model, weibull, linear

2)      functions

a.       main_curve_fitting(x, y, fun, params=None, sigma=None, max_nrmse=1e-2, max_iterations=10,  remove_outlier=True): determines fitting parameters for float iterables   x and y, with y=fun(x). **kwargs:

    - params: iterable of initial values for fitting function fun,     if None, parameters are provided by generator function if fitting functions is `pt1`,     `logistic`, `dbl_logistic_model` or `double_pt1_model`, else AssertionError raises.

    - max_nrmse: fitting  routine will be aborted, if nrmse of fit < mac_nrmse

    - max_iterations: maximum number fitting operation, global abortion criteria. If reached before max_nrmse criterium was fullfilled, best fitting results are returned.

b.      curve_fitting_from_table(t, funs, fun_params=None, id_cols=('feature_id',), time_col='time',  value_col='mi_fraction', max_nrmse=0.05, max_iterations=10,  missing_tp_as_0= False, remove_outlier=False):

*In place function  that applies fitting functions funs  to table and selects function with best fitting result based on nmrse.  Columns 'fit_pararms'   (calculated parameters of fitting function),' fit_stds' (standard devieation of fitting

     function), 'fit_nrmse' (normalized root mean square error), and fit_fun (fitting function)      are added. If remove outlier==True, column outlier

     containing remvoved outlier (x,y) will be add. Attributes:

    - funs: list of functions for fitting

    - fun_params: initial fitting function parameters. if None, parameters are provided by

      generator functions in case fitting functions is defined in the fitting moduel

      (e.g. `pt1`, `logistic`, `dbl_logistic_model`, ...), else AssertionError raises.

    - id_cols: Tuple defining subtable for fitting. It allows combining different columns

      to define subgroup. Example: To fita all isotpologues of a feature you can combine

      the two columns as identifier by id_cols=(‘feature_id’, ‘mi’)

    - time_col: Defines column with x- values (time)

    - value_col: Defines y values for plot

    - max_nrmse: Maximal allowed nrmse value to accept a fit

    - max_iterations: number of allowed iterations for nrmse minimization

    - missing_tp_as_0: replaces None value by zero

    - remove_outlier: Removes (x, y) pair from fit if deviation between fitted and existing values exhibit 3 times standard deviation

Module in_out

·         save_dict(dic, path=None, overwrite=False, startAt=None): Saves python dictionary as json file. Note that all keys are converted into string by the Python json module. If no path is provided argument startAt allows assigning an initial directory for the path dialog.

·         load_dict(path=None, startAt=None): loads dictionary saved in .json format if path exists, else default dialog is opened argument startAt allows assigning an initial directory for the path dialog.

·         load_tables(pathes=[], startAt=None, exclude_blanks=False): load_tables(pathes=[], startAt=None, filter_blanks=False) loads multiple emzed tables of type table, CSV and json from a list of pathes(for details about tables in json see table_operations.table_to_dict). If pathes is empty a dialog opens and argument startAt allows assigning an initial directory for the path dialog. If exclude_blanks is True only files not labeled with 'blank' in file name are loaded. For details see -> filter_blanks function.

·         save_list_of_tables(tables, path=None, force_overwrite=True, startAt=None) merges a list

    of tables applying emzed stackTables function, merged table will be saved as '.tables'.

·         load_list_of_tables(path=None, startAt=None) loads merged list of tables of type '.tables' which was saved with function -> save_list_of_tables and splits it into origin list of tables.

·         load_peakmaps(pathes=[], startAt=None, exclude_blanks=False): load_peakmaps(pathes=[], startAt=None, filter_blanks=False) loads multiple pealmaps of type mzML and mzXML from a list of pathes. If pathes is empty a dialog opens. If exclude_blanks is True only files not labeled with 'blank' in file name are loaded. For details see -> filter_blanks function.

·         enhanced_save_table(t, path=None, force_overwrite=True, startAt=None) automatically adds time label to  filename if force_overwrite is set to `False` and path exists. Allowed filename characters are restricted and unaccepted strings will open a dialog to allow the user filename correction. Filename type is automatically set to `table`. If no path is provided argument startAt allows assigning an initial directory for the path GUI dialog.

·         load_table_item(path=None): determines data type from ending. Can load emzed table saved as json, table, and csv, assertion occurs if file type does not fit. Output is a Table. If path is not specified a dialog box opens.

·         save_item_as_pickle(value, path=None, overwrite=True, startAt=None): Saves any python object as pickle. It automatically adds time label to filename if force_overwrite is set to `False` and the path exists. Allowed filename characters are restricted and unaccepted strings will open a dialog to allow the user correcting the filename. Filename type is automatically set to `pickled`. If no path is provided argument startAt allows assigning an initial directory when starting the GUI dialog.

·         load_pickled_item(path=None): loads pickled python objects. If no path is provided argument startAt allows assigning an initial directory when starting the GUI dialog.

·         filter_blanks(pathes): filters a list of paths if the file name contains the word blank seperated by white space, -, _ or if placed at the end of the file name the word blank is followed by `.`. Examples:   '…fgdfg_blank_gfdh…', 'gfdgdfg -BlaNK.mzML', '121 BLANK_dfsfsdf.mzML'

 

Module table_operations

·         split_table_by_columns(t, colnames, remove_split_id=True): allows splitting split id composed by several columns defined in iterable colnames. this avoids nested appyling of splitBy example code:

t=t=emzed.utils.toTable('id1', (0,0,0,1,1,1), type_=int)

t.addColumn('id2', range(3) *2, type_=int)

t.addColumn('a', range(6), type_=int) 

results

            id1      id2      a      

            int      int      int    

            ------   ------   ------ 

            0        0        0      

            0        1        1      

            0        2        2      

            1        0        3      

            1        1        4      

            1        2        5      

 

       split_table_by_columns(t, ['id1', 'id2']) results 6 tables :

          

           id1      id2      a      

           int      int      int    

           ------   ------   ------ 

           0        0        0      

          

           id1      id2      a      

           int      int      int    

           ------   ------   ------ 

           0        1        1      

          

           id1      id2      a      

           int      int      int    

           ------   ------   ------ 

           0        2        2      

 

           id1      id2      a      

           int      int      int    

           ------   ------   ------ 

           1        0        3      

 

           id1      id2      a      

           int      int      int    

          ------   ------   ------ 

           1        1        4      

         

           id1      id2      a      

           int      int      int    

           ------   ------   ------ 

           1        2        5      

      

       and the code equals:

           subsets=[]

           for subset1 in t.splitBy('id1'):

               subsets.extend(subset1.splitBy('id2'))

 

·         table_to_dict(t, as_json=False): converts table into dictionary ('colname', values). An additional key layout contains information about column types and formats. If as_json is True, Peakmaps, Blobs and Subtables are lost during conversion!

·         dict_to_table(dic): Inverse function of table_to_dict. converts dictionary into table with colnames dict.keys()and column values must be lists, tuples of the same length or unique values. Additionaly, a key layout is required containing a list of tuples (column name, column type_, column format).

·         transfer_column_between_tables(t_source, t_sink, data_col, ref_col, insert_before=None): adds values from column data_col from Table `t_source` to Table `t_sink` via common reference column `ref_col`. If column `data_col` exists already in t_sink an assertion occurs.

·         peakmap_as_table(pm): converts peakmap pm into table with a single row andwith columns peakmap, unique_id, 'full_source', and source.

·         reduce_peakmap_size_in_table(peaks_table, ref_col='id', tol=(-60.0, +60.0, -10.0, +10.0), mslevel=1): in Place,  function  reduces data size for e.g. targeted data analysis. Peakmaps are cut to fit targeted extraction window defined by argument ref_col. Argument tol is a tuple with values (lower rttol, upper rttol, lower mztol, upper mztol) and defines tolerance for peakmap cutting.

·         single_spec_to_table(prec, spec): Single_spec_to_table(prec, spec) converts ms2 spectrum into table. Suitable for to visualize MS spectra from top n or similar approaches.

·         add_plots_to_table(t, id2plots, id_cols, plot_colname='plot'): E.g. functions ``plot_heatmaps_from_isotope_table`` and `plot_fitting_curves_from_table`` return a dictionary containing path for each fitting id dictionary id2plot keys: id_col values, values: plot_pathes

·         find_common_postfix(t, colnames=None): Function find_common_postfix(t, colnames=None) finds any common postfix for all colnames  of table t. Argument `colnames` is a list of strings, and if set common postfix of selected colnames is returned if listed colnames are all columns of t else,   an empty string is returned.

·         update_rt_by_integration(t) returns table with updated rt value of spectrum with maximal intensity of selected m/z – rt window.  Required columns rtmin, rtmax, mzmin, mzmax, peakmap

 

Module mf_utils

·         count_element(el, mf): counts number of element el in molecular formula mf.

·         restrict_db(db, C = (2,None), H=(0,None), N= (0,None), O=(0,None),    P=(0,None), S= (0,None), el2range=None, mass_range=None):

Filters data_base db  for compounds composed of  C, H, N, O, P, S. Further elements can be included  by el2range  with allowed elements are keys and

tuple values defining the lower and upper limit for each element. If upper limit equals None no  upper limit is given.  If a formula contains elements not present in the dictionary or the number of elements is not in predifined range, the formula will be removed. Example: if you set el2range={'Fe': (1,2)} and keep default values for plot only formula containing 1-2 Fe and at least 2 C will be selected. Optional a global mass range for the data base can be defined with mass_range.

·         build_mf_restriction_lib(mass, dm=70, db=None, el2range=None):

Function returns number range of elements C, H , N , O,    P, and S for for all formulas found in data base db within mass range (mass-dm; mass + dm) for

heuristic restriction of the solution space. The output is a dictionary with element as key and observed element distribution tuple as value with entries (min, mean, std, max) example {C:(2, 14, 11, 24), 'O': (0, 3, 6, 12)}. If no data base is provided emzed pubchem data base will be used. Prior to analyis applied data_base will be filtered

by dictionary el2range default el2range={'C': (2,None), 'H':(0,None), 'N': (0,None), 'O':(0,None), 'P':(0,None), 'S': (0,None)}, where allowed elements are keys and

 the tuple defines the lower and upper limit for each element, if upper limit equals None no upper limit is given. If a formula contains elements not present in the dictionary the formula will be removed.

·         restricted_formula_table(min_mass, max_mass, C=(0, None), H=(0, None), N=(0, None), O=(0, None), P=(0, None), S=(0, None), prune=True, level='extended'):

This is the enlarged version of emzed.utils.formulaTable. This function generates a table containing molecular formulas consisting of elements C, H, N, O, P and S having a mass in range [**min_mass**, **max_mass**].  If **prune** is *True*, mass ratio rules (from "seven golden rules") and valence bond checks are used to avoid unrealistic compounds in the table. Moreover, rule #4 – Hydrogen/Carbon element ratio check and rule #5 – heteroatom ratio check, as well # rule 6: element probability check of molecular formula mf will be applied. For rule    #4 and # 5 two different levels of inclusion can be applied: common: 99.7 % of all formulas are covered, extended: 99.99% of all formulas are covered. To cover molecules like BPG or FBP 'extended'  mode is required.       

For each element one can provide an given count or an inclusive range of atom counts considered in this process. Putting some restrictions on atomcounts, eg **C=(0, 100)**, can speed up the process tremendously.

·         x2c_check(mf, el='H', level='extended'): #Seven golden rules Rule #4 – Hydrogen/Carbon element ratio check and rule #5 – heteroatom ratio check  from 7 golden rules paper: level == common: 99.7 % of all formulas are covered, level == extended: 99.99% of all formulas are covered. Additional filter function for function emzed.utils.formulaTable.

·         el_comb_check(mf):  Rule 6: element probability check of molecular formula mf from 7 golden rules paper Additional filter function for function emzed.utils.formulaTable.

 

 

Module utils

·         remove_files_from_path(path, type_='*.*'):

·         enhanced_integrate(t, step=1, fwhm=None, max_dev_percent=20, min_area=100, mslevel=1, n_cpus=None): Function requires only 1 eic peak per row !; common postfix for all colnames!  Performs emg_exact based peak integration for integration intervals tep*fwhm, 0.5/step*fwhm and 2*step*fwhm. if columns fwhm not in table fwhm can be provided and must be float > 0.    Alternatively, fwhm is calculated as rtmax-rtmin. emg_exact integration is accepted if peak area difference between both integration < max_dev_percent.

·         cache_result(fun, params, path, foldername= 'cache'): cache_result creates  folder `cache` in path and  saves function output in folder. Function arguments params must be list or tuple with parameter values or a dictionary with function argument keywords as keys. 'In place' operations are not cached.

·         remove_cache(path, foldername='cache'): removes subfolder foldername (`cache` by default) and its content in directory  specified in argument `path`.

·         process_time(fun, args=None, kwargs=None, in_place=False): returns process time of function fun with parameters params. if in_place_== True no value will be returned

·         compare_peaks(t1, t2, mztol=0.003, rttol=5.0, keep_t1=False): returns a joined table containing peaks present in t1 and t2. if keep_t1=True, all peaks of t1 are kept and only those present in t2 and t1 are kept for t2. Required colnames: mz, rt