CytochromeP450 (CYPs) enzymes mediated drug metabolism influences drug pharmacokinetics and results in adverse outcomes in patients through drug-drug interaction (DDIs). It is found that the primary reasons for DDIs, resulting from drugs that udergo CYPs metabolism, are CYPs inhibition and induction.
The SuperCYPsPred web server includes machine learning model based on Random Forest algorithm, and different types of data sampling method.
The models present here discriminate the inhibitors and non-ihibitors for five major Cytochromes (CYPs) isoforms. The statistical siginificance of the predictive models was assessed by 10-fold fragment-based cluster cross-validation on the training sets, and their predictive ability was evaluated on external test sets.
The computational models are focused on the first step of the safety assessment. We hope that the understanding of the DDIs enabled via SuperCYPsPred web server will help to approximately adjust or reinvent research and development strategies.
It is freely avaialble for academic and non-commercial users.
For advanced users, data can be queried using a simple POST interface with a suitable language of your choice. Below, a short introduction and sample code in Python (Version 3.6) is provided.
Please note that for single queries, the script is slower than the website, as it is set to allow several users a chance to queue their requests. The more models you require, the longer the query intervals take due to computation time.
A source IP is allowed a maximum of 250 API queries a day.
You can download this script to your local computer and use it, or write your own with the script as a reference : SuperCYPSsPred API Script
To run the script, you would need to install python on your system, and invoke your command line (either via cmd on windows, or opening a terminal on linux or mac os).
The interface allows you to query by name (fulfilled via PubChem search) or canonical SMILES string. As a minimum, you need only enter one or more such identifiers (separated by comma).
If you prefer no status outputs save errors, use the -q command line switch.
Additional data can be supplied using command line switches, from specifying the input type (if you want to input canonical SMILES), to selecting the models (you can see a full list of models either in the model information, or in the script itself, in the ALL_MODELS declaration)
The API by default returns data in the form of JSON strings. These objects can easily be unpacked in most languages, and allow for convenient transfer of nested arrays. The following hierarchical components are provided as keys in each response written to your outfile:
id : The request id that was used to retrieve this dataset, marking each individual request
name : if using name input type, the compound name requested, otherwise empty string
smiles : if using canonical SMILE input type, the input smiles, otherwise empty string
cyp_models: If selected, Data for all other computable models with name, prediction and prediction confidence
[model name] : Shorthand name of the model
Prediction : Boolean value if activity or inactivity is predicted (1=Active, 0=Inactive)
Probability : Float value from 0 to 1 giving confidence of the above result
Please note the output is intended to be machine-readable. To inspect it manually, using a JSON-Viewer like Stack.hu JSON viewer or Code Beautify JSON Viewer is recommended.
However, otherwise, the website itself is more suitable to such visualization.
In this study, we have used previously published, cheminformatics based machine learning models to predict chemical activities. The previous models (chemical toxicity predictions) based on the similar methodology are made available via our ProTox-II webserver and are highly cited. More details on the methodology, please look into our published work
The following are the different data sampling methods as used in this study to handle the imbalanced datasets:
1) No Sampling: All the data were used without any manipulation, so called ‘original dataset’.
2) Random Under Sampling (RandUS): The data points from the majority class are removed randomly.
3) Augmented Random Under Sampling (AugRandomUS): Random under sampling in general removes instances of the dataset randomly. In this modified version, the randomness was reduced by utilizing a specifically calculated fingerprint called most common features (MCF) that incorporates all the common features in the data set. The features in this fingerprint are derived from MACCS fingerprints and Morgan fingerprints respectively. To produce this fingerprint the overall average frequency of all the features in the majority class is computed. Then, for each bit position of the fingerprint the relative frequency of ones in the complete data set is computed. If the relative frequency of a bit position is higher than the average frequency the respective bit position and the frequency is saved. Following the average number of features per fingerprint of the majority class is used to specify the number of the features per fingerprint of the MCF fingerprint, whereas the features themselves are specified by the saved features having the highest relative frequencies. Subsequently iteration is performed that is completed as soon as the majority data set is reduced to the size of the minority data set. In each step, a number of samples being the most similar to the MCF fingerprint are collected in a list. Then a number of instances is randomly chosen from the list and removed from the data set. Thereafter, a new MCF fingerprint is computed and the iteration is continued. In this way, the samples most similar to the MCF fingerprint are removed; the loss of variance of the majority set is decreased. In addition, the loss of information is reduced by removing a limited number of samples per calculated MCF fingerprints.
4) Random over sampling (RandOS): Data points from the minority class are randomly chosen and added to the existing minority class.
5) Augmented Random Over Sampling (AugRandOS): Random oversampling in this case also follows the same principle mentioned under the augmented random under sampling before. Only difference in this case, in each iteration step a list of samples most dissimilar to the MCF fingerprint is created. A part of the list is chosen randomly to be duplicated and added to the original data set. Since the samples most dissimilar to MCF are duplicated the loss of variance is relatively low. Both steps are repeated until the minority class consists of as many samples as the majority class.
6) K-Medoids Under Sampling (kMedoids1): K-medoids is a clustering algorithm that is used to under sample the original majority class. A medoid is itself an instance of the majority class utilized as a cluster center that has the minimum average dissimilarity between itself and all majority data points in its cluster. The number of medoids is equal to the number of majority class instances. A sample is assigned to that cluster with which center it shares the highest similarity based on Tanimoto coefficient (Willett, 2003). For each of the medoids the sum of the similarities between itself and all samples belonging to its cluster is calculated. The algorithm tries to maximize the combination of these sums by performing iteration. The iteration is limited to 100 steps, in each of the iterations new medoids are randomly chosen and the overall sum of Tanimoto similarities is calculated. The set of medoids producing the highest sum is used as under sampled majority class. By means of clustering by similarity, this approach creates a subset of which each individual data point represents a group of structurally related molecules, in turn reducing the information lost by under sampling.
7) K-Medoids Under Sampling (kMedoids2): Similarly to kMedoids1 this method starts with randomly choosing n samples as medoids, where n is equal to the number of data points in the minority class. For each of the chosen medoids, a total number of 30 iterations are assigned. In each iterative step, a medoid is exchanged with a random majority class sample, new clusters are computed and the cost is calculated using Tanimoto coefficient. The final set of medoids is chosen based on the maximum sum of similarities.
8) Synthetic Minority Over-Sampling Technique-using Tanimoto Coefficient (SMOTETC): The SMOTE method creates synthetic samples of the minority class to balance the overall data set. Depending on the amount of oversampling a number of samples of the minority class are chosen. For each of those, the k-nearest neighbors are identified, utilizing the Tanimoto coefficient as similarity measure (Willett, 2003). The feature values of the new synthetic data points are set to the value occurring in the majority of the chosen sample and two of its k-nearest neighbors.
9) Synthetic Minority Over-Sampling Technique-using Value Difference Metric (SMOTEVDM): This method is also based on SMOTE, but the k- nearest neighbors are chosen using the Value Difference Metric (VDM) as similarity measure. The VDM defines the distance between analogous feature values over all input feature vectors. More detailed information on the algorithm for computing VDM can be found here (Sugimura et al., 2008)..
The SuperCYPsPred web server implements 10 different In silico models for CYPs inhibition (each CYP isoform have two different models based on the molecular fingerprints).
Definition of the CYPs isoform and their relative significance in the drug metabolsim process
1) CYP1A2: CYP1A2 is expressed in the liver and accounts for approximately 13% - 15% of the total CYP content, contributing to the metabolism of approximately 4% of marketed drugs. CYP1A2 preferentially oxidizes aromatic hydrocarbons as well as heterocyclic and
aromatic amines and plays an important role in the metabolism of several clinical drugs, including analgesic, antipyretic, antipsychotic, antidepressant, anti-inflammatory, and cardiovascular
drugs. CYP1A2 has been reported to catalyze N-hydroxylationof pre-carcinogenic heterocyclic amines to carcinogenic compounds. Therefore, in addition to predicting DDIs, it is important to understand CYP1A2 inhibition while researching on
2) CYP2C9: The CYP2C9 family accounts for approximately 20% of hepatic
P450s, and CYP2C9 is responsible for the hepatic clearance of 15% of
clinically relevant drugs (including phenytoin, tolbutamide, and
warfarin) as the first step in drug clearance, limiting drug oral
bioavailability. CYP2C9 inhibitors include fluvastatin, flu-
voxamine, zafirlukast, and antifungal imidazole compounds (mi-
conazole, fluconazole, and sulconazole).
3) CYP2C19: CYP2C19 is an essential member of the CYP450 superfamily and it contributes about 16% of total hepatic content. CYP2C19 is the principal enzyme involved in the hepatic metabolism of drugs such as antimalarial (proguanil), oral anticoagulants (R-warfarin), chemotherapeutic agents (cyclophosphamide), anti-epileptics (S-mephenytoin, diazepam, phenobarbitone), antiplatelets (clopidogrel), proton pump inhibitors (omeprazole, pantoprazole, lansoprazole, rabeprazole), antivirals (nelfinavir), and antidepressants (amitriptyline, clomipramine).
4) CYP2D6: CYP2D6 metabolizes approximately 30% of all marketed drugs,
including antiarrhythmics, antidepressants, antipsychotics, beta-
blockers, and analgesics; although it accounts for only 2%-4% of
all human hepatic CYPs. CYP2D6 is a polymorphic P450 iso-
form, in which the active enzyme is absent in 5%-10% of Caucasians
and 1% of Asians. Therefore, much emphasis is placed on
CYP2D6 metabolism and its potential for clinically relevant drug
interactions early in the drug discovery process..
5) CYP3A4: CYP3A4 is the most abundant human hepatic CYP isoform and is
responsible for the metabolism of approximately 50% of known
drugs, including cyclosporine, testosterone, dextromethorphan,
diazepam, and midazolam. The inhibition of CYP3A4 by co-
administered drugs is shown to result in clinically adverse DDIs
owing to the decreased systemic clearance of CYP3A4 substrates
and rapid and unexpected increases in plasma concentrations .
Indeed, most DDIs that result in the withdrawal of drugs that are
already available in the market are caused by CYP3A4 inhibition. Therefore, the early identification of potential CYP3A4
inhibitors is required to minimize the risk of clinically relevant
The applicablity domain of each model was assessed on the criterion that the confidence in the correctness of predictions should be greater inside the domain than outside it. By using two different methods a domain was used to assessed if the majority of correct positive predictions for each model were within the domain and a large portion of the incorrect positive predictions outside it.
It was observed that compounds for which correct predictions are made and are found withhin the applicability domain are more likely to be correctly predicted than those found outside of the domain. However, this was not the case for the compounds which were classified as inactive.
The methods applied for defining the applicability domain of the SuperCYPsPred models are discussed below:
1) Fragment based sampling method: Using the fragment-based sampling method, the active and inactive properties of the chemicals were mapped on the fragment space. Fragments space allowed straightforward similarity measure between fragments of the same class.
Often an entire molecule may not be responsible for the activity, but a local feature (such as a substructure or a fragment) in a molecule may be responsible for the desired response. Chemical fragments are local parts of chemical structures, representing molecular features useful in the modelling of biological or physicochemical properties of chemicals.
Fragments propensities were mainly used to detect the meaningful features and to capture continuous relationships that exist between the fragments of the same class.
They offer intuitive interpretation of the model performance, easy to generate and handle. Both the training and test sets were fragmented using RECAP and ROTBONDS fragmentation method.
The fragments from the training set were then used to search to see if the test set contains the same fragments, using substructure search. The test compounds were considered within the applicability domain of the model, only when the fragments from the test compound were found in the training set and vice versa.
2) Structural similarity method: Using the structural similarity method, a global search is computed such as if the query compound is similar to the compounds in the training set (class specific), to consider the prediction with the domain of applicability.
This is based on the hypothesis 'structurally similar compounds tends to exhibit similar activity'. Structural similarities between the test and training set compounds were compared using the Tanimoto score. The test compound were considered within the applicability domain if they showed a similarity value of 0.6 and above to one or more of the training set compounds (See the section 'how to interpret the results').
To start a CYPs activity prediction, please go to SuperCYPsPred Prediction. Here, you can either draw your input compound, paste the content of a molfile in textform or search for a compound name online:
To draw a chemical structure, use the buttons in the second row (as shown above). You can change atom types by clicking on the arrow next to "C" or change bond types or draw ring structures.
To open a molfile, please click on the yellow folder button in the first row (as shown above). You can paste the contents of a molfile (text from) here. You can also search for a known compound online. To do that, click on the binocular in the first row of buttons (as shown above). You can search for a compound name in the PubChem database.
An example compound is already mentioned e.g. Sertraline. To use the example compound, simply type the name and click on name search. To clear the drawing area, press the button with the blue bottle in the first row of buttons.
Once you have drawn or inserted an input molecule (either name or SMILES), you can start the CYPS prediction by clicking on the Submit button below the drawing area. Additionally, you can select the models of your choice or all the models mentioned for prediction as well as fingerprints of your choice (It is suggested to use both the fingerprints in order to see the difference of predictions as well as overall prediction performance).
Please note that the prediction of multiple models for a single compound can be time-consuming.
An estimation of the calculation time is given at the top of the results page.
In the result page information of the input compound is displayed first inlcuding name or SMILES (as choosen initially by the user), molecular weight, hydrogen bond acceptors/donors, number of atoms, number of bonds, number of rings, molecular polar surface area (as shown in the picture below)
The second part will include the three most similar compounds (with their respective acitivity class for each CYPs) to the input compound, computed from the training data set. A threshld of 60% similarity measure is used, in order to show most relevant compounds.
In the third section, the prediction results (predicted class, with their respective class probaility etc.) are given in a tabular format. The colour indiciate the the strength of probability for the respective classes. Red (strong) and pink (low) strength of prediction for active class. On the otherhand dark green (strong) and light green (low) strength of prediction for inactive class as indicated in the picture below.
A radar plot (example below) is provided to assess the comparison between the different cytochrome models active compounds average probability from the training set to that of the input compound.
If two fingerprints are selected as a prediction parameter, two different plots will be shown, else a single plot based on the fingerprint choosen initially.#
The plot can be accessed clicking the 'MACCS or MORGAN' link that will appear on the page once the computation is complete, which will open the chart in a new tab. The profile of the input compound is shown using blue lines/dots which represents the predicted probabilities of the input compound for respective CYPs models.
The data displayed is orange dots/lines is the average probability of its active class, acquired by computing from the training set data for each model (see model info).
For the example case Sertraline, the predicted probabilities for the models using MACCS fingerprints is shown in the first picture and using Morgan fingerprints in the second picture.
This chart helps the user to get an understanding, how strong is the overall prediction of the input compound, considering its activity for multiple model endpoints.
Additionally, information on known CYPs interactions for the input compound if available are shown with the references from literature/ respective sources. This information is only shown when the input compound is a drug and a drug name is used in the search criteria (as input).
This information is taken from our previously published and manually curated database SuperCYP. It will be prudent to mention that information on literature sources includes strong, moderate and weak interactions with CYPs. Further more information on type of interactions like whether the drug is substrate, inhibitor or inducer is also indicated.
This information is provided as an additonal reference along with the prediction results. The comparison of the prediction results and information of the known CYPs information may not always hold true for all the cases. It is highly suggested that the user should look into the respective references to know more about the results.
Furthermore, information on individual CYPs can be obtained by clicking into their names mentioned in the above table. The page below indicates, the structure with UniProt ID for each CYP, along with the name of the drugs involved in the interactions. Each type of interactions are defined as inhibitor, substrate or/and inducer, along with their references.
This will help the user to understand the importance of respective CYPs, and their indivual drug interactions profile.
Drug-drug interactions (DDI) can trigger unepected pharmacological effects, including several adverse drug reactions (ADRs), with no known casual mechanism.
DDI occurs when the effects of one drug are modified by the concomitant use of a second drug.
The interaction between the two example drugs Theophylline and Ciprofloxacin, can result in toxic increases in theophylline. This problem occurs because the hepatic metabolism of theophylline is inhibited by ciprofloxacin via the cytochrome P-450 enzyme system.
Theophylline is metabolized by CYP1A2 and to a lesser extent by CYP3A4 (as shown in the table below). Ciprofloxacin and other drugs, including clarithromycine, erythromycin, fluvoamine, and cimetidine, are all potent inhibitors of CYP1A2. Because they have no known effector less effect on CYP1A2, Fleroxacine or ofloxacine should be considered as an alternative to ciprofloxacin ( as shown in green in the table below).
Theophylline toxicity is a serious condition; several deaths have been linked with serum concentrations as low as 25 mg/mL. Signs of theophylline toxicity include headache, dizziness, hypotension, halllucinations, tachycardia, and seizures.
Currently, the DDI table contains small portion of the predicted inhibitors information, we will update it soon.