wtbox
Documentation
The wtbox contains no main functions with in place operations meaning even default in place object manipulations e.g. t.addColumn(‘x’, ..) are suppressed by copying the objects.
Module checks_and_settings
This module contains functions for predefined table settings that can check whether a certain column fulfills minimum requirements for an operation e.g. if a workflow (function) requires an integration step you can check whether the table object fulfills the minimal requirements e.g. It has the columns rtmin, rtmax, mzmin, mzmax, peakmap (+ postfix) and the rtx and mzx contain float values and the columns peakmap contains a type PeakMap object. This is ongoing work and will be further completed.
·
is_ff_metabo_table(table):
verifies whether table column names and column types correspond to
featureFinderMetabo output table
·
is_integratable_table(t): Emzed function utils.integrate can be applied
·
is_integrated_table(t): Emzed function utils.integrate was applied
·
is_ms_peaks_table(t): Table is
suited for targeted peaks extraxtion
·
is_isotopologue_distribution_table(t, fid='feature_id', isotope_id='mi', fraction_col='isotope_fraction')
·
colname_type_checker(t, colnames, name2types=None): checks
whether column names [colnames] of table
[t] are of type required. Optional [name2types -> dictionary {name : type}].
If None, colnames of standard dictionary
colname_type_settings (e.g. mz, rt) are checked.
Module default_guis
· adapt_ff_metabo_config(config=None, advanced=[0]): provides a GUI to adapt ffMetabo settigs by user and returns modified kwargs. Optional arguments: 1) config: dictionary with ffMetabo parameterws as keys, if config equals None a default setup is provided 2) advanced: bool if True all parameters can be edited else only a subset which is sufficient for most users
Module default_plottings
The Module covers two frequently required plots from LC-MS data: heatmaps which and fitting curve plots.
·
plot_from_table_columns(t, id_cols, plot_fun,
plot_fun_args2colnames, save_dir,
plot_colname='plot', kwargs=None):
description of function arguments:
o
t: table
o
id_cols:
tuple with columns to define sub_tables with plot values. Example: to plot isotopique pattern of different
compounds and samples in the same table you can assign the plots correctly via
the column name tuple (sample_name, compound_name).
o
plot_fun:
function to create the plot.
o
plot_fun_args2values:
dictionary with plot function arguments as keys and tuple (colname,
uniqueValue) as value {plotarg : (columnname, False)}, if uniqueValue is True and all only 1 value in column
single value is extracted
o
save_dir:
direction where to save figure files
o
kwargs: plot
function dependendent plotting parameter arguments e.g. dpi, plot type_, colors, styles, ....
·
plot_heatmap(data, xlabels, ylabels,
label_right=False, colorbar= True, pad_colorbar=0.1, binsize=None, title=None, save_dir=None,
cmap="Greens", none_color="#777777", fig_type='png',
dpi=None):
plots heatmap including axis labels, colorbar
and title: parameters:
o
data : 2d numpy array
o
xlabels : list of strings, len is
number of colums in data
o
ylabels : list of strings, len is
number of rows in data
o
label_right : boolean, indicates
if labels at right of heatmap should be plotted
o
colorbar : show colorbar, default
= True
o
pad_colorbar: float in range 0 ..
1, distance of colorbar to heatmap
o
binsize : None or float in range
0..1, if this value is not None the heat map and
o
the colorbar are discretised
according to this value.
o
title : None or string
o
cmap : string with name of
colormap, see help(pylab.colormaps) for alternatives
o
none_color : RGB string for
plotting missing values.
·
plot_heatmaps_from_isotope_table(t, id_col='feature_id',
columns_col='time', rows_col='mi', value_col='mi_frac_corr',
add_missing_values=0.0, sort_col='mi', cmap='binary', save_dir=r'P:\tmp'):
Plots
heatmap plot from table columns using function plot_heatmap and returns a
dictionary containing id_col as key and correspondong plot_path as value. To
add plots to table see -> wtbox.table_operations.add_plots_to_table
o
id_col: defines the subtable for
each heatmap
o
columns_col: x-axis of heatmat
plot
o
rows_col: y-axis of heatmap plot
o
value_col: value assigned to
(x,y)
o
add_missing_values: replaces None
values by user defined value
o
cmap : string with name of
colormap, see help(pylab.colormaps) for alternatives
o
save_dir: saving direction of
heatmap plots
·
plot_fitting_curve(x, y, x_fit, y_fit,
x_tick_labels='', y_tick_labels='', x_ticks=None, unit='', y_ticks=None,
ylabel=None, title=None, outlier_col=None, outlier_values = [(None,None)]):
Default scatter plot of measured values x, y
combined with line line plot of fitted values of fitted values x_fit, y_fit.
Optional attributes:
o
x_tick_labels: list of string (in
general numbers), labeling x-ticks
o
y_tick_labels: list of string (in
general numbers), labeling y-ticks
o
x_ticks: list of float values,
positioning x ticks
o
y_ticks: list of float values,
positioning y ticks
o
y_label : name of x axis
o
title: plot title
o
outlier_value: list of (x, y)
tuples that allows depicting outlier values in different color separately.
·
plot_fitting_curves_from_isotope_table(t, id_cols=('feature_id',),
time_col='time', value_col='no_C13', fun_col='no_C13_fitting_fun',
params_col='no_C13_fit_params',
add_missing_tp_as_zero=True, outlier_col=None, num_points=50,
save_dir=r'P:\tmp', fig_type='png', dpi=None):
Plots fitting curve of (isotope) table t.
Function returns a dictionary containing id_col as key and correspondong
plot_path as value. To add plots to
table see -> wtbox.table_operations.add_plots_to_table. Function attributes:
o
id_cols: Tuple
defining subtable for fitting plot. It allows combining different columns to
define subgroup. Example: To plot all isotpologues of a feature you can combine
the two columns as identifier by id_cols=(‘feature_id’, ‘mi’)
o
time_col: Defines
column with x- values (time)
o
value_col: defines y
values for plot
o
fun_col: Fitting
function that was applied
o
params_col: parameters
determined with fitting function to calculate y_fit
o
add,_missing_tp_
as_zero: None values will be shown as zero
o
outlier_col:
a column can be provided that contains (x,y) pairs which have been
excluded from fitting process.
o
num_points: Number of
points calculated with fitting_function to draw line plot
Module feature_analysis
The module covers most common calculation for features including most normalization methods.
·
calculate_mid(t, id_col='feature_id',
quantity_col='area', result_col='mi_fraction')
Determines the mass isotopologue distribution of feature with id id_col and writes the result in column result_col. By default quantity_col is area.
· normalize_peaks_by_idms(t, id_col='feature_id', mi_col='mi', quantity_col='area', result_col='idms_ratio')
Calculates ratio of unlabeled (minimal labeled) and fully labeled mass istopologues of each feature. Attributes:
o id_col: groupes mass isotopoloques of a compound adduct ion
o
mi_col: nominal isotopoloque mass shift
o quantity_col: peak area (intensity)
o result_col: name of the result columnnormalize_peaks_with_internal_standard(t, id_col='feature_id', is_id_col='internal_standards', quantity_col='area', result_col='norm_area')
·
normalize_peaks_with_internal_standards(t, norm_id_col='norm_with',
is_id_col='internal_standard', quantity_col='area', result_col='norm_area')
Normalizes peaks area (intensity) to user defined internal standard area (intensity). Attributes:
o norm_id_col: contains id of internal standard used for normalization
o is_id_col: defines peak as internal standard and assigns id to the peak.
o quantity_col: peak area (intensity)
o result_col: name of the result column
·
normalize_peaks_with_TIC(t, quantity_col='area',
result_col='norm_area')
Normalizes peak area of each peak to total table area. Attributes:
o quantity_col: peak area (intensity)
o result_col: name of the result column
·
calculate_feature_labeled_fraction(t, id_col= 'feature_id',
mi_col='mi', mi_fraction_col='mi_fraction_corr', result_col='labeled_fraction'):
Calculates labeled isotope fraction fromm mass isotopologue
distribution. Attributes :
o id_col: groupes mass isotopoloques of a compound adduct ion
o
mi_col: nominal isotopoloque mass shift
o mi_fraction_col: column with mass calculated mass isotopologue distribution
o result_col: name of the result_col (optional)
·
filter_top_n_by_fragment_ion(top_n_table, ions):
Filters a top_n table for specific fragment ions see -> feature_extraction.top_n_to_table. Attributes:
o ions: list of fragment ion m/z values
o mztol: m/z tolerance in units
o min_int: minimsl peak intensity
· correct_for_natural_C13(t, id_='feature_id', isotope_id='mi', fraction_col='isotope_fraction', mf_col='mf'):
*in place function subtracts carbon labeling originating from natural C13 abundance from from calculated mass isotopologue distribution and writes result in coilumn fraction_col with postfix ‘_corr’. The function is best suited for high mass resolution data since it takes not into account heavy stable isotopes of other elements. Attributes:
o id_col: groupes mass isotopoloques of a compound adduct ion
o
isotope_id: nominal isotopoloque mass shift
o fraction_col: column with mass calculated mass isotopologue distribution
o mf_col: column with corresponding molecular formula
Module feature_extraction
·
targeted_peaks_ms(peakmap, peaks_table,
fwhm=None, max_dev_percent=20, min_area=100, n_cpus=None):
Function targeted_peaks_ms extracts MS level 1 peaks LC-MS peaks of peakmap defined in table peaks_table using function -> wtbox.utils.enhanced_integrate. Optional parameters are related to latter function. Peaks_table requires columns 'mzmin', 'mzmax', 'rtmin', 'rtmax'. You can also provide additional column ‘fwhm’ (see enhanced integrate).
·
targeted_peaks_ms2(peakmap, peaks_table,
fwhm=None, max_dev_percent=20, min_area=100, n_cpus=None, step=1):
Function targeted_peaks_ms2 extracts MS level 2 peaks LC-MS peaks of peakmap defined in table peaks_table using function -> wtbox.utils.enhanced_integrate. Optional parameters are related to latter function. Peaks_table requires columns 'mzmin', 'mzmax', 'rtmin', 'rtmax'.
·
metaboff_ms2(peakmap,
ff_metabo_config=None): untargeted
extraction of all mslevel 2 peaks applying feature_finder_metabo. If
ff_metabo_config is not provided a dialog windiw will open.
·
top_n_to_table(pm_, rttol=5, mztol=0.003):
Function extracts all precursor_ion peaks on MS level 1 with corresponding MS level 2 spectra. Attributes:
Peakmap: PeakMap with mslevels [1, 2];
rttol: retention time tolerance in second,
mztol: m/z tolerance in units.
·
adapt_rt_windows(peaks_table, pm,
split_by='feature_id', mslevel=None): adapt_rt_windows(peaks_table, pm, split_by='feature_id',
mslevel=None) allows manually adapting retention time windows to peaks by
integration for mslevels 1 and 2. peaks_table required columns: mzmin, mzmax,
rtmin, rtmax, and column defined by split_by. mslevel will be ignored if
peaks_table provides column ‘mslevel’. By default, mslevel is 1. If mslevel ==2
column precursor_ion is required. If more than one peak per group is integrated,
the one with largest area will be selected to modify retention time.
Module fitting
1) provided fitting functions: pt1, decay, logistic, double_logistic_model, double_pt1_model, weibull, linear
2) functions
a. main_curve_fitting(x, y, fun, params=None, sigma=None, max_nrmse=1e-2, max_iterations=10, remove_outlier=True): determines fitting parameters for float iterables x and y, with y=fun(x). **kwargs:
- params: iterable of initial values for fitting function fun, if None, parameters are provided by generator function if fitting functions is `pt1`, `logistic`, `dbl_logistic_model` or `double_pt1_model`, else AssertionError raises.
- max_nrmse: fitting routine will be aborted, if nrmse of fit < mac_nrmse
- max_iterations: maximum number fitting operation, global abortion criteria. If reached before max_nrmse criterium was fullfilled, best fitting results are returned.
b. curve_fitting_from_table(t, funs, fun_params=None, id_cols=('feature_id',),
time_col='time',
value_col='mi_fraction', max_nrmse=0.05, max_iterations=10, missing_tp_as_0= False, remove_outlier=False):
*In place function that applies fitting functions funs to table and selects function with best fitting result based on nmrse. Columns 'fit_pararms' (calculated parameters of fitting function),' fit_stds' (standard devieation of fitting
function), 'fit_nrmse' (normalized root mean square error), and fit_fun (fitting function) are added. If remove outlier==True, column outlier
containing remvoved outlier (x,y) will be add. Attributes:
- funs: list of functions for fitting
- fun_params: initial fitting function parameters. if None, parameters are provided by
generator functions in case fitting functions is defined in the fitting moduel
(e.g. `pt1`, `logistic`, `dbl_logistic_model`, ...), else AssertionError raises.
- id_cols: Tuple defining subtable for fitting. It allows combining different columns
to define subgroup. Example: To fita all isotpologues of a feature you can combine
the two columns as identifier by id_cols=(‘feature_id’, ‘mi’)
- time_col: Defines column with x- values (time)
- value_col: Defines y values for plot
- max_nrmse: Maximal allowed nrmse value to accept a fit
- max_iterations: number of allowed iterations for nrmse minimization
- missing_tp_as_0: replaces None value by zero
- remove_outlier: Removes (x, y) pair from fit if deviation between fitted and existing values exhibit 3 times standard deviation
Module in_out
·
save_dict(dic, path=None,
overwrite=False, startAt=None): Saves python dictionary as json file. Note that all keys are
converted into string by the Python json module. If no path is provided
argument startAt allows assigning an initial directory for the path dialog.
· load_dict(path=None, startAt=None): loads dictionary saved in .json format if path exists, else default dialog is opened argument startAt allows assigning an initial directory for the path dialog.
· load_tables(pathes=[], startAt=None, exclude_blanks=False): load_tables(pathes=[], startAt=None, filter_blanks=False) loads multiple emzed tables of type table, CSV and json from a list of pathes(for details about tables in json see table_operations.table_to_dict). If pathes is empty a dialog opens and argument startAt allows assigning an initial directory for the path dialog. If exclude_blanks is True only files not labeled with 'blank' in file name are loaded. For details see -> filter_blanks function.
· save_list_of_tables(tables, path=None, force_overwrite=True, startAt=None) merges a list
of tables applying emzed stackTables function, merged table will be saved as '.tables'.
·
save_tables_as_excel(tables, path=None,
force_overwrite=True, startAt=None):
Saves a list of tables in excel file where each table is written to a separate sheet. Sheet names are set to t.title if t.title exists. Else sheet names are enumerated. If path is None a dialog opens and argument startAt allows assigning an initial directory for the path dialog. The function automatically adds time label to filename if force_overwrite is set to `False` and path exists.
· load_list_of_tables(path=None, startAt=None) loads merged list of tables of type '.tables' which was saved with function -> save_list_of_tables and splits it into origin list of tables.
· load_peakmaps(pathes=[], startAt=None, exclude_blanks=False): load_peakmaps(pathes=[], startAt=None, filter_blanks=False) loads multiple pealmaps of type mzML and mzXML from a list of pathes. If pathes is empty a dialog opens. If exclude_blanks is True only files not labeled with 'blank' in file name are loaded. For details see -> filter_blanks function.
· enhanced_save_table(t, path=None, force_overwrite=True, startAt=None) automatically adds time label to filename if force_overwrite is set to `False` and path exists. Allowed filename characters are restricted and unaccepted strings will open a dialog to allow the user filename correction. Filename type is automatically set to `table`. If no path is provided argument startAt allows assigning an initial directory for the path GUI dialog.
· load_table_item(path=None): determines data type from ending. Can load emzed table saved as json, table, and csv, assertion occurs if file type does not fit. Output is a Table. If path is not specified a dialog box opens.
· save_item_as_pickle(value, path=None, overwrite=True, startAt=None): Saves any python object as pickle. It automatically adds time label to filename if force_overwrite is set to `False` and the path exists. Allowed filename characters are restricted and unaccepted strings will open a dialog to allow the user correcting the filename. Filename type is automatically set to `pickled`. If no path is provided argument startAt allows assigning an initial directory when starting the GUI dialog.
· load_pickled_item(path=None): loads pickled python objects. If no path is provided argument startAt allows assigning an initial directory when starting the GUI dialog.
· filter_blanks(pathes): filters a list of paths if the file name contains the word blank seperated by white space, -, _ or if placed at the end of the file name the word blank is followed by `.`. Examples: '…fgdfg_blank_gfdh…', 'gfdgdfg -BlaNK.mzML', '121 BLANK_dfsfsdf.mzML'
Module table_operations
· split_table_by_columns(t, colnames, remove_split_id=True): allows splitting split id composed by several columns defined in iterable colnames. this avoids nested appyling of splitBy example code:
t=t=emzed.utils.toTable('id1', (0,0,0,1,1,1), type_=int)
t.addColumn('id2', range(3) *2, type_=int)
t.addColumn('a', range(6), type_=int)
results
id1 id2 a
int int int
------ ------ ------
0 0 0
0 1 1
0 2 2
1 0 3
1 1 4
1 2 5
split_table_by_columns(t, ['id1', 'id2']) results 6 tables :
id1 id2 a
int int int
------ ------ ------
0 0 0
id1 id2 a
int int int
------ ------ ------
0 1 1
id1 id2 a
int int int
------ ------ ------
0 2 2
id1 id2 a
int int int
------ ------ ------
1 0 3
id1 id2 a
int int int
------ ------ ------
1 1 4
id1 id2 a
int int int
------ ------ ------
1 2 5
and the code equals:
subsets=[]
for subset1 in t.splitBy('id1'):
subsets.extend(subset1.splitBy('id2'))
· table_to_dict(t, as_json=False): converts table into dictionary ('colname', values). An additional key layout contains information about column types and formats. If as_json is True, Peakmaps, Blobs and Subtables are lost during conversion!
· dict_to_table(dic): Inverse function of table_to_dict. converts dictionary into table with colnames dict.keys()and column values must be lists, tuples of the same length or unique values. Additionaly, a key layout is required containing a list of tuples (column name, column type_, column format).
· transfer_column_between_tables(t_source, t_sink, data_col, ref_col, insert_before=None): adds values from column data_col from Table `t_source` to Table `t_sink` via common reference column `ref_col`. If column `data_col` exists already in t_sink an assertion occurs.
· peakmap_as_table(pm): converts peakmap pm into table with a single row andwith columns peakmap, unique_id, 'full_source', and source.
· reduce_peakmap_size_in_table(peaks_table, ref_col='id', tol=(-60.0, +60.0, -10.0, +10.0), mslevel=1): in Place, function reduces data size for e.g. targeted data analysis. Peakmaps are cut to fit targeted extraction window defined by argument ref_col. Argument tol is a tuple with values (lower rttol, upper rttol, lower mztol, upper mztol) and defines tolerance for peakmap cutting.
· single_spec_to_table(prec, spec): Single_spec_to_table(prec, spec) converts ms2 spectrum into table. Suitable for to visualize MS spectra from top n or similar approaches.
· add_plots_to_table(t, id2plots, id_cols, plot_colname='plot'): E.g. functions ``plot_heatmaps_from_isotope_table`` and `plot_fitting_curves_from_table`` return a dictionary containing path for each fitting id dictionary id2plot keys: id_col values, values: plot_pathes
· find_common_postfix(t, colnames=None): Function find_common_postfix(t, colnames=None) finds any common postfix for all colnames of table t. Argument `colnames` is a list of strings, and if set common postfix of selected colnames is returned if listed colnames are all columns of t else, an empty string is returned.
· update_rt_by_integration(t) returns table with updated rt value of spectrum with maximal intensity of selected m/z – rt window. Required columns rtmin, rtmax, mzmin, mzmax, peakmap
Module mf_utils
· count_element(el, mf): counts number of element el in molecular formula mf.
·
restrict_db(db, C = (2,None),
H=(0,None), N= (0,None), O=(0,None),
P=(0,None), S= (0,None), el2range=None, mass_range=None):
Filters data_base db for compounds composed of C, H, N, O, P, S. Further elements can be included by el2range with allowed elements are keys and
tuple values defining the lower and upper limit for each element. If upper limit equals None no upper limit is given. If a formula contains elements not present in the dictionary or the number of elements is not in predifined range, the formula will be removed. Example: if you set el2range={'Fe': (1,2)} and keep default values for plot only formula containing 1-2 Fe and at least 2 C will be selected. Optional a global mass range for the data base can be defined with mass_range.
· build_mf_restriction_lib(mass, dm=70, db=None, el2range=None):
Function returns number range of elements C, H , N , O, P, and S for for all formulas found in data base db within mass range (mass-dm; mass + dm) for
heuristic restriction of the solution space. The output is a dictionary with element as key and observed element distribution tuple as value with entries (min, mean, std, max) example {C:(2, 14, 11, 24), 'O': (0, 3, 6, 12)}. If no data base is provided emzed pubchem data base will be used. Prior to analyis applied data_base will be filtered
by dictionary el2range default el2range={'C': (2,None), 'H':(0,None), 'N': (0,None), 'O':(0,None), 'P':(0,None), 'S': (0,None)}, where allowed elements are keys and
the tuple defines the lower and upper limit for each element, if upper limit equals None no upper limit is given. If a formula contains elements not present in the dictionary the formula will be removed.
· restricted_formula_table(min_mass, max_mass, C=(0, None), H=(0, None), N=(0, None), O=(0, None), P=(0, None), S=(0, None), prune=True, level='extended'):
This is the enlarged version of emzed.utils.formulaTable. This function generates a table containing molecular formulas consisting of elements C, H, N, O, P and S having a mass in range [**min_mass**, **max_mass**]. If **prune** is *True*, mass ratio rules (from "seven golden rules") and valence bond checks are used to avoid unrealistic compounds in the table. Moreover, rule #4 – Hydrogen/Carbon element ratio check and rule #5 – heteroatom ratio check, as well # rule 6: element probability check of molecular formula mf will be applied. For rule #4 and # 5 two different levels of inclusion can be applied: common: 99.7 % of all formulas are covered, extended: 99.99% of all formulas are covered. To cover molecules like BPG or FBP 'extended' mode is required.
For each element one can provide an given count or an inclusive range of atom counts considered in this process. Putting some restrictions on atomcounts, eg **C=(0, 100)**, can speed up the process tremendously.
· x2c_check(mf, el='H', level='extended'): #Seven golden rules Rule #4 – Hydrogen/Carbon element ratio check and rule #5 – heteroatom ratio check from 7 golden rules paper: level == common: 99.7 % of all formulas are covered, level == extended: 99.99% of all formulas are covered. Additional filter function for function emzed.utils.formulaTable.
· el_comb_check(mf): Rule 6: element probability check of molecular formula mf from 7 golden rules paper Additional filter function for function emzed.utils.formulaTable.
Module utils
· remove_files_from_path(path, type_='*.*'):
· enhanced_integrate(t, step=1, fwhm=None, max_dev_percent=20, min_area=100, mslevel=1, n_cpus=None): Function requires only 1 eic peak per row !; common postfix for all colnames! Performs emg_exact based peak integration for integration intervals tep*fwhm, 0.5/step*fwhm and 2*step*fwhm. if columns fwhm not in table fwhm can be provided and must be float > 0. Alternatively, fwhm is calculated as rtmax-rtmin. emg_exact integration is accepted if peak area difference between both integration < max_dev_percent.
· cache_result(fun, params, path, foldername= 'cache'): cache_result creates folder `cache` in path and saves function output in folder. Function arguments params must be list or tuple with parameter values or a dictionary with function argument keywords as keys. 'In place' operations are not cached.
· remove_cache(path, foldername='cache'): removes subfolder foldername (`cache` by default) and its content in directory specified in argument `path`.
· process_time(fun, args=None, kwargs=None, in_place=False): returns process time of function fun with parameters params. if in_place_== True no value will be returned
· compare_peaks(t1, t2, mztol=0.003, rttol=5.0, keep_t1=False): returns a joined table containing peaks present in t1 and t2. if keep_t1=True, all peaks of t1 are kept and only those present in t2 and t1 are kept for t2. Required colnames: mz, rt