INTRODUCTION

What is the motivation for g3mclass development?

Numerous studies show that random variation may occur in the results of biomedical tests sampled from the populations diagnosed with the same disease. The best-known example of a random variable in oncology is human epidermal growth factor receptor 2 (HER2) reported to have higher than normal levels of expression in the breast (15-30%), gastric and gastroesophageal (10-30%), ovarian (20-25%), endometrial (14-80%), bladder (23-80%), and lung (up to 20%) cancers. Statistical models incorporating the probability distribution for a random biological variable could be useful for understanding the nature of disease progression, development of targeted therapies, and improvement of patient outcomes.

What methodology is behind g3mclass?

The real-world readouts of laboratory tests rarely fit into one normal (Gaussian) distribution and for many analytes Gaussian distribution is not achieved even after data transformation. Furthermore, when comparing the results from the test (with disease) and refence (without disease) samples, it is rare for two distributions of measured values to be completely separated. This creates a methodological dilemma for a choice of a diagnostic cutoff value that impacts clinical decision. The paper published in Cancer Research, 2019; 79 :3492-502 offered the potential solution for unmixing quantitative assay data through using Bayesian approach and Gaussian Mixture Model (GMM). The performance of the proposed probabilistic classifier has been validated over datasets of more than 300 clinical samples and has been shown to improve the rule-based binary classification of tumor markers. This inspired the development of g3mclass, a software that automates this method with add-ons capabilities in a graphical user interface (GUI).

How g3mclass is unique?

Unlike other statistical programs aimed at statistical hypothesis testing, g3mclass helps to perform probabilistic statistical classification of each of the dataset variables into as many as probable categories. As an advanced analytical tool, it is more informed than the rule-based classification and thus may improve statistical analysis of the data from quantitative molecular assays. Focused on Bayesian statistics, g3mclass offers three classification approaches.

Who are potential users of g3mclass?

g3mclass aims to ease adoption of the probabilistic classifier in research, biomedical pharma, companion diagnostics, and ultimately in healthcare. Currently, it is intended for basic and translational biomedical research to help scientists accelerate candidate biomarkers and therapeutic targets evaluation workflow.

What does g3mclass do?

g3mclass is a classification and visualization software purpose-built for modeling the molecular assay data sampled from healthy (reference) and diseased (test) populations. Additional query samples (e.g. suspected disease) obtained by the same assays may be classified.

How does g3mclass work?

After uploading a file with the reference, test and optional queries datasets, User may set up the model parameters (defaults or User-choice), and immediately learn the test GMM depicted in a plot. The model learning is initialized by the expectation-maximization (EM) algorithm with classes that correspond to peaks in the histogram calculated on the test sample. The class labeled '0' is one that has the same mean value and standard deviation as the coupled reference sample. It is imposed into the test GMM; however, its weight in the model is not fixed. If the mean values of other modeled classes are lower or higher than that of class '0', than they are labeled with either negative (e.g. -1, -2, etc) or positive (e.g. 1, 2, etc) integers, respectively. The bigger the integer the further the class is positioned relative to class '0'. The g3mclass GUI allows to setup parameters for model learning - choose either the fixed or variable number of bins in a histogram, dismiss classes having too low number of samples and fuse too close classes. The preferred model is selected automatically based on the lowest Bayesian information criterion (BIC). Upon model selection, the classification results are presented in spreadsheets and heatmaps for test, reference, and query (if present). The classification of refence and all queries is based upon the corresponding test model.

g3mclass performs classification in 3 consecutive steps:

How stable g3mclass's parameter estimations?

Data contained in 'ref', 'test' and 'query' samples are noised. If we redo the same measurements, the results can be different to various extent. So the answer to this question will depend on the signal/noise ratio in any given experiment. To help user to asses parameter stability in any given case, g3mclass implements resampling technique. If asked, the software can generate new 'ref' (or 'test' or both) samples by random resampling. Then for each of them, it recalculates GMM fitting as if it were an independent experiment. Summary statistics calculated over resamples are presented for some key parameters including class means and thresholds.