Taste of a chemical compound present in food stimulates us to take in nutrients and avoid poisons.
Many active ingredients present in drugs, taste bitter and thus are aversive to children as well as many adults.
Bitterness of medicines presents compliance problems and early flagging of potential bitterness of a drug candidate may help its further development.
Similarly, both sweetness and bitterness prediction of a compound is of large interest for the food industry.
In this work, we have built 24 different machine learning models to predict three different taste endpoints - bitterness, sweetness, and sourness of compounds.
The methods used in this work was previously published (BitterSweetForest 2018).
The constructed model yielded an accuracy of 95% and an AUC of 0.98 in cross-validation.
In an independent test set, BitterSweetForest achieved an accuracy of 96% and an AUC of 0.98 for bitter and sweet taste prediction.
VirtualTaste to the best of our knowledge is the first freely available web-based platform for the prediction of organoleptic properties of compounds and will be a useful resource for basic taste chemistry related research as well as new sweet and bitter tasting compounds discoveries in the industry.
We hope this platform will be useful resource for basic taste chemistry related research as well as new sweet and bitter tasting compounds discoveries in the industry, enabling researchers approximately adjust or reinvent research and development strategies.
It is freely avaialble for academic and non-commercial users.
To start a taste prediction, please go to VirtualTaste Prediction page. Here, you can either draw your input compound, paste the content of a molfile in textform or search for a compound name online:
To draw a chemical structure, use the buttons in the second row (as shown above). You can change atom types by clicking on the arrow next to "C" or change bond types or draw ring structures.
To open a molfile, please click on the yellow folder button in the first row (as shown above). You can paste the contents of a molfile (text from) here. You can also search for a known compound online. To do that, click on the binocular in the first row of buttons (as shown above). You can search for a compound name in the PubChem database.
An example compound is already mentioned e.g. Aspartame. To use the example compound, simply type the name and click on name search. To clear the drawing area, press the button with the blue bottle in the first row of buttons.
Once you have drawn or inserted an input molecule (either name or SMILES), you can select the models of your choice or all the models (Bitter, Sweet and Sour) for prediction. Future versions of VirtualTaste will also allow the selection of different molecular fingerprints, the current version uses the same molecular fingerprint for all predictions. After your selections you can start the taste prediction by clicking on the Submit button.
Please note that the prediction of multiple models for a single compound can be time-consuming.
An estimation of the calculation time is given at the top of the prediction page.
On the result page information of the input compound is displayed first, inlcuding name or SMILES (as choosen initially by the user), molecular weight, hydrogen bond acceptors/donors, number of atoms, number of bonds, number of rings and molecular polar surface area (as shown in the picture below).
The second part will include the three most similar compounds (with their respective acitivity class for each oranoleptic property) to the input compound, computed from the training data set. A threshold of 60% similarity measure is used, in order to show most relevant compounds.
In the third section, the prediction results (predicted class, with their respective class probaility etc.) are given in a tabular format. The colour indiciate the the strength of probability for the respective classes. Red (strong) and pink (low) strength of prediction for active class. On the otherhand dark green (strong) and light green (low) strength of prediction for inactive class as indicated in the picture below.
A radar plot (example below) is provided to assess the comparison between the different taste models active compounds average probability from the training set to that of the input compound.
The plot can be accessed clicking the thumbnail that will appear on the page once the computation is complete, which will open the chart in a new tab. The profile of the input compound is shown using orange lines/dots which represents the predicted probabilities of the input compound for respective taste models.
The data displayed as blue dots/lines is the average probability of its active class, acquired by computing from the training set data for each model (see model info).
The previous picture shows the radar chart for the example compound for which only one endpoint (sweet) was predicted to be active.
If multiple taste endpoints are predicted to be active for an input compound, a radar chart might look like this:
This chart helps the user to get an understanding, how strong the overall prediction of the input compound is, considering its activity for multiple model endpoints.
The last section on the result page is the target prediction which only appears when the compound is predicted to be bitter. It uses pair-wise ligand similarity to predict which bitter receptors the input compound is predicted to bind to. The resulting table displays the predicted receptor and the similarity score to a compound that is know to interact with this receptor.
The information about the input compound as well as the results for the similar compounds, the taste activity prediction and the target prediction can be downloaded seperately as csv files by clicking on the download button in the desired section.
Taste prediction is of specific interest when it comes to drugs. On the Drug Taste page, a number of pre-predicted approved drugs can be found and browsed through. The table displays the Drugbank ID, the name of the drug, the taste it was predicted to have (S for Sweet and B for Bitter) and the confidence of that prediction. The table can be sorted by each of those parameters in an ascending or descending order by clicking the sorting symbols in the table header. By clicking on the the Drugbank ID of a specific drug, the Drug Bank entry for this compound will open and more information on it can be found there.
Furthermore, a set of natural compounds available in the Supernatural II database can also be access through the Natural Compounds page. Similar to the Drug Taste table, this table gives and overview of a number of pre-predicted compounds and can be sorted by the individual columns which includes Supernatural ID, the predicted taste and the prediction confidence. Additionally, the SMILES for each chemical is available. By clicking on the the Supernatural ID of a specific compound, you will be linked to the corresponding Supernatural II entry and more information on the compound will be available there.
To see a 2D image of a specific natural compound, hover over the corresponding SMILES string to see it pop up.
For advanced users, data can be queried using a simple POST interface with a suitable language of your choice. Below, a short introduction and sample code in Python (Version 3.6) is provided.
Please note that for single queries, the script is slower than the website, as it is set to allow several users a chance to queue their requests. The more models you require, the longer the query intervals take due to computation time.
A source IP is allowed a maximum of 250 API queries a day.
You can download this script to your local computer and use it, or write your own with the script as a reference : VirtualTaste API Script
To run the script, you would need to install python on your system, and invoke your command line (either via cmd on windows, or opening a terminal on linux or mac os). If you're using the script we provied, make sure you have the following python packages are installed: requests, time, argparse, json, sys, urllib.
The interface allows you to query by name (fulfilled via PubChem search) or canonical SMILES string. As a minimum, you need only enter one or more such identifiers (separated by comma).
If you prefer no status outputs save errors, use the -q command line switch.
Additional data can be supplied using command line switches, from specifying the input type (if you want to input canonical SMILES), to selecting the models (you can see a full list of models either in the model information, or in the script itself, in the ALL_MODELS declaration)
The API by default returns data in the form of comma-separated values (csv) files and are machine-readable.
In this study, we have used previously published, cheminformatics based machine learning models to predict chemical activities.
The current models are based on previously published (BitterSweetForest and ProTox-II) machinelearning methods, and are highly cited.
More details on the methodology, please look into our published work
The following are the different data sampling methods as used in this study to handle the imbalanced datasets:
1) No Sampling: All the data were used without any manipulation, so called ‘original dataset’.
2) Random Under Sampling (RandUS): The data points from the majority class are removed randomly.
3) Augmented Random Under Sampling (AugRandomUS): Random under sampling in general removes instances of the dataset randomly. In this modified version, the randomness was reduced by utilizing a specifically calculated fingerprint called most common features (MCF) that incorporates all the common features in the data set. The features in this fingerprint are derived from MACCS fingerprints and Morgan fingerprints respectively. To produce this fingerprint the overall average frequency of all the features in the majority class is computed. Then, for each bit position of the fingerprint the relative frequency of ones in the complete data set is computed. If the relative frequency of a bit position is higher than the average frequency the respective bit position and the frequency is saved. Following the average number of features per fingerprint of the majority class is used to specify the number of the features per fingerprint of the MCF fingerprint, whereas the features themselves are specified by the saved features having the highest relative frequencies. Subsequently iteration is performed that is completed as soon as the majority data set is reduced to the size of the minority data set. In each step, a number of samples being the most similar to the MCF fingerprint are collected in a list. Then a number of instances is randomly chosen from the list and removed from the data set. Thereafter, a new MCF fingerprint is computed and the iteration is continued. In this way, the samples most similar to the MCF fingerprint are removed; the loss of variance of the majority set is decreased. In addition, the loss of information is reduced by removing a limited number of samples per calculated MCF fingerprints.
4) Random over sampling (RandOS): Data points from the minority class are randomly chosen and added to the existing minority class.
5) Augmented Random Over Sampling (AugRandOS): Random oversampling in this case also follows the same principle mentioned under the augmented random under sampling before. Only difference in this case, in each iteration step a list of samples most dissimilar to the MCF fingerprint is created. A part of the list is chosen randomly to be duplicated and added to the original data set. Since the samples most dissimilar to MCF are duplicated the loss of variance is relatively low. Both steps are repeated until the minority class consists of as many samples as the majority class.
6) K-Medoids Under Sampling (kMedoids1): K-medoids is a clustering algorithm that is used to under sample the original majority class. A medoid is itself an instance of the majority class utilized as a cluster center that has the minimum average dissimilarity between itself and all majority data points in its cluster. The number of medoids is equal to the number of majority class instances. A sample is assigned to that cluster with which center it shares the highest similarity based on Tanimoto coefficient (Willett, 2003). For each of the medoids the sum of the similarities between itself and all samples belonging to its cluster is calculated. The algorithm tries to maximize the combination of these sums by performing iteration. The iteration is limited to 100 steps, in each of the iterations new medoids are randomly chosen and the overall sum of Tanimoto similarities is calculated. The set of medoids producing the highest sum is used as under sampled majority class. By means of clustering by similarity, this approach creates a subset of which each individual data point represents a group of structurally related molecules, in turn reducing the information lost by under sampling.
7) K-Medoids Under Sampling (kMedoids2): Similarly to kMedoids1 this method starts with randomly choosing n samples as medoids, where n is equal to the number of data points in the minority class. For each of the chosen medoids, a total number of 30 iterations are assigned. In each iterative step, a medoid is exchanged with a random majority class sample, new clusters are computed and the cost is calculated using Tanimoto coefficient. The final set of medoids is chosen based on the maximum sum of similarities.
8) Synthetic Minority Over-Sampling Technique-using Tanimoto Coefficient (SMOTETC): The SMOTE method creates synthetic samples of the minority class to balance the overall data set. Depending on the amount of oversampling a number of samples of the minority class are chosen. For each of those, the k-nearest neighbors are identified, utilizing the Tanimoto coefficient as similarity measure (Willett, 2003). The feature values of the new synthetic data points are set to the value occurring in the majority of the chosen sample and two of its k-nearest neighbors.
9) Synthetic Minority Over-Sampling Technique-using Value Difference Metric (SMOTEVDM): This method is also based on SMOTE, but the k- nearest neighbors are chosen using the Value Difference Metric (VDM) as similarity measure. The VDM defines the distance between analogous feature values over all input feature vectors. More detailed information on the algorithm for computing VDM can be found here (Sugimura et al., 2008).
The VirtualTaste web server implements 3 different In silico models for the prediction of sweet, bitter and sour compounds.
1) VirtualSweet: The VirtualSweet classifier based on RandomForest algorithm, predicts sweetness of the input compound.
The discovery of sweetners using human taste-panel or cell-based high-throughput screening is an expensive, laborious and extremely time consuming process.
Using computational methods for sweetness prediction can provide a good alternative and enable rapid identification of sweet tasting compounds.
2) VirtualBitter: The VirtualBitter classifier based on RandomForest algorithm, predicts whether the given input compound is bitter or not. Bitter taste of medicine presents a major compliance problem for padiatric drugs
Additionaly, for food industry knowning bitterness of certain substances present in the food components is important. Thus, effecient computational models, for the prediction and masking of bitterness of active substances both in drugs and food, are an important consideration for the pharmaceutical and food chemistry research and industry.
3) VirtualSour: Sour taste is influenched by pH and acids present in foods. Here a data-driven machine-learning method based on ligand-based approach is emplaoyed to predict the sour/non-sour compounds
The applicablity domain of each model was assessed on the criterion that the confidence in the correctness of predictions should be greater inside the domain than outside it. By using two different methods a domain was used to assessed if the majority of correct positive predictions for each model were within the domain and a large portion of the incorrect positive predictions outside it.
It was observed that compounds for which correct predictions are made and are found withhin the applicability domain are more likely to be correctly predicted than those found outside of the domain. However, this was not the case for the compounds which were classified as inactive.
The methods applied for defining the applicability domain of the VirtualTaste models are discussed below:
1) Fragment based sampling method: Using the fragment-based sampling method, the active and inactive properties of the chemicals were mapped on the fragment space. Fragments space allowed straightforward similarity measure between fragments of the same class.
Often an entire molecule may not be responsible for the activity, but a local feature (such as a substructure or a fragment) in a molecule may be responsible for the desired response. Chemical fragments are local parts of chemical structures, representing molecular features useful in the modelling of biological or physicochemical properties of chemicals.
Fragments propensities were mainly used to detect the meaningful features and to capture continuous relationships that exist between the fragments of the same class.
They offer intuitive interpretation of the model performance, easy to generate and handle. Both the training and test sets were fragmented using RECAP and ROTBONDS fragmentation method.
The fragments from the training set were then used to search to see if the test set contains the same fragments, using substructure search. The test compounds were considered within the applicability domain of the model, only when the fragments from the test compound were found in the training set and vice versa.
2) Structural similarity method: Using the structural similarity method, a global search is computed such as if the query compound is similar to the compounds in the training set (class specific), to consider the prediction with the domain of applicability.
This is based on the hypothesis 'structurally similar compounds tends to exhibit similar activity'. Structural similarities between the test and training set compounds were compared using the Tanimoto score. The test compound were considered within the applicability domain if they showed a similarity value of 0.6 and above to one or more of the training set compounds (See the section 'how to interpret the results').
To cite this web server:
Fritz F, Preissner R, Banerjee P. VirtualTaste: a web server for the prediction of organoleptic properties of chemical compounds. Nucleic Acids Res. 2021 Apr 27:gkab292. doi: 10.1093/nar/gkab292. PMID: 33905509.
Banerjee P, Preissner R. BitterSweetForest: A Random Forest Based Binary Classifier to Predict Bitterness and Sweetness of Chemical Compounds. Front Chem. 2018 Apr 11;6:93. doi: 10.3389/fchem.2018.00093. PMID: 29696137; PMCID: PMC5905275.