Suppose
you are interested to know if the analyte values obtained in the
laboratory assay from test and reference may be assigned to the same
or different classes. Moreover, you are interested to know to which
class each of the assayed values belongs. Additionally, you might be
curious to know how the analyte values obtained in the same
laboratory assay, but in the independent set of samples could be
assigned. Now, let’s see how g3mclass software may help to classify
the laboratory assay data.
To classify test and reference data obtained from the same assay, organize a data file in columns with the analyte identification numbers (id) and name/s:
id (ref) analyte name (ref) id(test) analyte name (test)
If the analyte was measured in the same assay but in the independent set/s of samples, you may add query/s:
id (query1) analyte name (query1) id (query2) analyte name (query2) etc.
A data file must be saved in a TSV file format to a chosen directory.
In the example below, we analyze the expression of ERBB2, a gene encoding a known tumor marker - human epidermal growth factor receptor 2 (HER2). The ERBB2mRNA data were collected from the test (disease), reference (no diseased), and query (suspected disease) samples. Data file named ‘ERBB2 ID ref, test, qry.txt’ was saved as a tab-delimited text (.txt) file using Microsoft Excel (for more information, see Input data format).
Open g3mclass software. The Welcome page with several tabs will launch:
Click the ‘Data’ tab (next to the ‘Welcome’ tab) to select and open the data file you intend to analyze. Alternatively, a data file may be opened from the menu tab File > Open
To start modeling test data, go to the ‘Parameters’ tab. You may first use ‘Defaults’ modeling settings (recommended) by clicking on the ‘Learn model’ button. The software selects Gaussian Mixture Model (GMM) with the lowest Bayesian information criterion (BIC) value picked among the set of models learned on histograms with a varying number of bins (vector of k: 10, 15, 20, 25, 30, 35, 40).
‘Parameters’
The selected model parameters and plots appear automatically in the tab ‘Model’ and ‘Model Plots’, respectively.
‘Model’
‘Model plots’
Based on GMM learned with default parameters, the ERBB2 mRNA test values fit into 4-class GMM with the following estimated weights per class: 0 (46%), 1 (38%), 2 (14%), and 3 (2%). Class 0 is like reference (in terms of the mean value and standard deviation), classes 1, 2, and 3 have the increased mean values of ERBB2 mRNA compared to the reference.
To modify a test model, you may customize modeling parameters using interactive sliders in the ‘Parameters’ tab and then clicking the ‘Learn model’ button again to learn a new model. You may always return to default settings by clicking the ‘Defaults” button (for more information, see Parameters).
For example, by choosing a fixed number of bins in a histogram (k=20), the software learns a new model with an increased BIC value compared to defaults (1660.772 vs. 1650.558). In this new model, the ERBB2 mRNA test values fit into 5-class GMM with the following weights per class: 0 (43%), 1 (36%), 2 (17%), 3 (2%) and 4 (2%). Class 0 is similar to reference (as above), classes 1, 2, 3, and 4 have the mean value of ERBB2 mRNA higher than reference. This model allows a more detailed classification of ERBB2 mRNA’s test values and may be preferred in some cases.
‘Parameters’
‘Model’
‘Model plots’
Once the satisfactory model is selected, the software utilizes this model for auto-classification of the test, reference, and query using three consecutive classifications (proba, cutoff, s.cutoff). The results appear in the ‘Test class’, ‘Ref class’, and ‘Query class’ tabs. These outputs include information on the class to which individual value (with related id) should belong with the highest probability (‘proba’), based on multi-cutoff (‘cutoff’) and more stringent multi-cutoff (‘s.cutoff’). Summary statistics are provided for each classification.
In this example, under defaults modeling the ERBB2 mRNA values are assigned to the following classes with the corresponding proportions per class in each type of classification:
The heatmaps, illustrating the classification of individual values (with id), can be found in the ‘Heatmaps’ tab. By default, it shows heatmaps for classification with a stringent cutoff. However, you may view heatmaps for all three classifications by checking the corresponding boxes and clicking the ‘Draw heatmaps’ button in the ‘Parameters’ tab (scroll all the way down the page to view them).
The following heatmaps show what class the ERBB2 mRNA reference values are assigned by each classification. The software also builds a heatmap for test and query (not shown here).
To customize a color palette in heatmaps and plots, you may use the plot settings controls also to be found in the ‘Parameters’ tab (for more information, see Parameters).
Once classifications are completed, you may save the original data file and all the results from the menu tab File > Save. All files are archived in a zipped folder in the directory of the original data file.
Note that under File>Save parameters, you may save all the parameters used for the current analysis. In case, you wish to use them again with the new data file, you may restore them automatically. To do so, click File>Open parameters.
Now, suppose you are interested to know how values of multiple analytes in test and reference measured in the same assay are classified. You may also be interested to know to which class each of the assayed values belongs. The software does not limit the number of analytes to be processed; however, larger files increase processing time and heatmap illustrations may be technically challenging.
In the example below, we consider 3 analytes. Let’s say one of the analytes- of- interest is the ERBB2 gene (the same as in EXAMPLE 1) and two others are candidate-gene markers: gene A and gene B. The mRNA expression levels for all three genes were measured in test (disease) and reference (healthy control) samples using a molecular assay with multiplexing capabilities. Let’s see how g3mclass software may help to classify three genes in each sample.
In this example, we save a data file named ‘ERBB2, gene A, gene B, ID ref, test.txt’ as a tab-delimited text (.txt) file using Microsoft Excel (for more information, see Input data format).
After opening this data file, the software learns models for each analyte and depicts each model plot (see below). Under default settings, the ERBB2 mRNA test values fit into 4-class GMM with the following weights per class: 0 (46%), 1 (38%), 2 (14%), and 3 (2%). The GMM fitted into the gene A mRNA test values have 4 Gaussian components with estimated weights per class: -1 (39%), 0 (27%), 1 (29%), and 2 (6%). The GMM fitted into the gene B mRNA test values has 2 classes: 0 (34%) and 1 (66%).
‘Model’
‘Model Plots’
Assuming these models are satisfactory, the software utilizes them for auto-classification of individual values from test and reference using three classifications (proba, cutoff, s.cutoff). The classification results for all analytes appear in the ‘Test class’ (examples shown below) and ‘Ref class’ (examples not shown here).
Under defaults, class outputs with proportions per class can be summarized as the following:
The heatmaps illustrate all three classifications for the ERBB2, gene A and gene B values in reference (above) and test (below).
Some analyte values may be initially incorrectly assigned by ‘proba’ classification. This may occur when the component of GMM has a wide dispersion with its tails picking up values that otherwise belong to a different class. The software calculates cutoffs to autocorrect the classification of individual values.
Let’s see how it works for gene B in EXAMPLE 2.
As we have seen above, the GMM for gene B has two components including class ‘0’ and class ‘1’, with the latter having a widespread dispersion. As a result, some low values of gene B with a high probability will be assigned to class ‘1’.
To control for the potential misclassification, the software calculates the cutoff value/s including a minimal misclassification cutoff with equal weights relative to adjacent classes and tolerable intervals of the misclassification error rate (a tradeoff between misclassification of one type for misclassification of the other type) recorded in ‘Model’ tab. In the case of 2-class GMM for gene B, the software calculates one minimal misclassification positive cutoff between class ‘0’ and class’1’(up-1=11.78) as well as left and right interval values for this cutoff (up-left=10.376; up-right=13.757).
Following ‘proba’ classification, the software performs ‘cutoff’ classification using ‘up-1’ cutoff first and then ‘s. cutoff’ using ‘up-right’ cutoff to maximize the specificity of detecting the gene B test values in class ‘1’ dissimilar to that in class ‘0’. The example below shows that a value ‘-95.729’ initially assigned by ‘proba’ to class ‘1’ is subsequently re-assigned by ‘cutoff’ and ‘s. cutoff’ to class ‘0’.
The aim of g3mclass software development is to help to address the uncertainty of classification of the test samples that otherwise may be misclassified or classified as ‘equivocal’ (a.k.a. ‘grey zone’). This software expects at least two kinds of entries for each analyte: reference (ref) and test (test).
g3mclass assumes that the probability density function for test:
can be represented by a finite number of Gaussian distributions,
can integrate the probability density function for reference which in turn is described by one Gaussian distribution.
In g3mclass, the test sample is used to learn a semi-constrained Gaussian mixture model (GMM). In this model, the position (i.e., the mean value) and the spread (SD value) of class numbered ‘0’ are constrained to be equal to the reference mean and SD values for a given test. However, the weight or proportion of this constrained class is an adjustable parameter. If the weight of semi-constrained class ‘0’ in the test GMM is close to null, then ‘proba’ classification of ref samples will be invalid. However, ‘cutoff’ classification results can be valid both for test and ref samples depending on the degree of overlap for those samples. The lower overlap the higher the validity of classification results will be.
Let's review the following example.
The model for gene Z has two components: class ‘0’ and class ‘-1’. The weight of class ‘0’ is next to 0%, whereas the weight of class ‘-1’ is near 100%.
The model plot shows that the test model is virtually a single Gaussian rather than a mixture of such. Because the weight of class ‘0’is negligible, the g3mclass ‘proba’ classifications are expected to be invalid for the ref sample.