Tabular Classification
Scikit-learn
English
lightgbm
biology
chemistry
drug-discovery
pharmacology
toxicology
cytotoxicity
promiscuity
selectivity
safety
polypharmacology
logistic-regression
gradient-boosting
shap
rdkit
molecular-fingerprints
morgan-fingerprints
cubic-regression
binary-classification
statsmodels
research
Eval Results
Discovery 2: Cytotoxicity Prediction Models
Pre-trained models for predicting drug cytotoxicity based on promiscuity and molecular structure.
Models Overview
This repository contains trained models from the Discovery 2 study on selectivity-safety coupling. The models predict cytotoxicity risk based on:
- Promiscuity (number of biological targets a compound hits)
- Molecular structure (Morgan fingerprints)
Model Files
1. Cubic Logistic Regression Models
Main Model: cubic_logistic_model.pkl
- Predicts cytotoxicity probability from overall promiscuity score
- Uses cubic polynomial features (promiscuity, promiscuity², promiscuity³)
- 50% threshold: 77 hits
- Performance: Significantly better than linear model (p < 0.001)
Class-Specific Models:
kinase_cubic_model.pkl- For kinase promiscuity (50% threshold: 25 hits)nr_cubic_model.pkl- For nuclear receptor promiscuity (50% threshold: 31 hits)7tm_cubic_model.pkl- For GPCR/7TM promiscuity (50% threshold: 63 hits)
2. LightGBM Structural Model
File: lightgbm_model.txt
- Predicts cytotoxicity from molecular fingerprints (2048-bit Morgan fingerprints, radius=2)
- Cross-validation AUC: 0.781 ± 0.045
- Identifies structural alerts (toxicophores)
- Can be used for compounds without promiscuity data
3. Metadata Files
cubic_model_metadata.json- Performance metrics for main cubic modelclass_models_metadata.json- Thresholds for class-specific modelslgb_model_metadata.json- LightGBM model performancefeature_stats.json- Feature statistics for normalization
Usage
Installation
pip install joblib lightgbm rdkit numpy pandas statsmodels
Loading Models
import joblib
import lightgbm as lgb
import json
# Load cubic logistic regression model
cubic_model = joblib.load('cubic_logistic_model.pkl')
# Load LightGBM model
lgb_model = lgb.Booster(model_file='lightgbm_model.txt')
# Load metadata
with open('cubic_model_metadata.json', 'r') as f:
cubic_metadata = json.load(f)
with open('feature_stats.json', 'r') as f:
feature_stats = json.load(f)
Predicting from Promiscuity Score
import numpy as np
from statsmodels.tools import add_constant
def predict_cytotoxicity_from_promiscuity(promiscuity_score, model):
"""
Predict cytotoxicity probability from promiscuity score
Args:
promiscuity_score: Number of active assays (hits)
model: Loaded cubic logistic regression model
Returns:
Probability of cytotoxicity (0-1)
"""
# Create cubic features
X = np.array([[promiscuity_score,
promiscuity_score**2,
promiscuity_score**3]])
X_with_const = add_constant(X)
# Predict probability
prob = model.predict(X_with_const)[0]
return prob
# Example usage
promiscuity = 50
prob = predict_cytotoxicity_from_promiscuity(promiscuity, cubic_model)
print(f"Promiscuity: {promiscuity} hits")
print(f"Cytotoxicity probability: {prob:.2%}")
Predicting from Molecular Structure
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
def predict_cytotoxicity_from_smiles(smiles, model):
"""
Predict cytotoxicity from SMILES string
Args:
smiles: SMILES representation of molecule
model: Loaded LightGBM model
Returns:
Probability of cytotoxicity (0-1)
"""
# Generate Morgan fingerprint
mol = Chem.MolFromSmiles(smiles)
if mol is None:
raise ValueError("Invalid SMILES")
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
fp_array = np.array(fp).reshape(1, -1)
# Predict
prob = model.predict(fp_array)[0]
return prob
# Example usage
smiles = "CC(C)Cc1ccc(cc1)C(C)C(O)=O" # Ibuprofen
prob = predict_cytotoxicity_from_smiles(smiles, lgb_model)
print(f"SMILES: {smiles}")
print(f"Cytotoxicity probability: {prob:.2%}")
Class-Specific Predictions
# Load class-specific model
kinase_model = joblib.load('kinase_cubic_model.pkl')
# Predict from kinase-specific promiscuity
kinase_hits = 20
prob = predict_cytotoxicity_from_promiscuity(kinase_hits, kinase_model)
print(f"Kinase promiscuity: {kinase_hits} hits")
print(f"Cytotoxicity probability: {prob:.2%}")
Risk Interpretation
Based on the cubic model thresholds:
| Promiscuity Range | Risk Level | Cytotoxicity Probability |
|---|---|---|
| < 43 hits | Low | < 25% |
| 43-102 hits | Moderate | 25-75% |
| > 102 hits | High | > 75% |
Class-Specific 50% Thresholds:
- Kinase: 25 hits (most sensitive)
- Nuclear Receptor: 31 hits
- 7TM/GPCR: 63 hits (least sensitive)
Model Performance
Cubic Logistic Regression
- Log-likelihood: -313.25
- AIC: 634.50
- Likelihood ratio test: p = 0.001 (significantly better than linear)
- 50% threshold: 77.4 hits
LightGBM Classifier
- Cross-validation AUC: 0.781 ± 0.045
- Features: 2048-bit Morgan fingerprints (radius=2)
- Training set: 1,382 compounds (13.2% cytotoxic)
Key Findings
- Non-linear relationship: Cytotoxicity risk accelerates rapidly above ~50 hits
- Strong predictive power: Compounds with >50 hits are 29.4× more likely to be cytotoxic
- Target class matters: Kinase promiscuity is more dangerous than GPCR promiscuity
- Structural alerts: Specific molecular substructures (toxicophores) are highly predictive
Use Cases
- Early drug discovery: Screen compounds for cytotoxicity risk
- Lead optimization: Prioritize compounds with lower predicted risk
- Polypharmacology assessment: Evaluate safety implications of multi-target drugs
- Structure-activity relationships: Identify problematic structural features
Limitations
- Models trained on specific compound library (1,397 FDA-approved small molecules)
- Cytotoxicity measured by cell viability assays (may not capture all toxicity mechanisms)
- Promiscuity-based models require activity data across multiple targets
- Structure-based model limited to compounds within chemical space of training data
Citation
If you use these models in your research, please cite:
Discovery 2: Cytotoxicity Prediction Models
Models: https://huggingface.co/pageman/discovery2-cytotoxicity-models
Dataset: https://huggingface.co/datasets/pageman/discovery2-results
Related Resources
- Dataset Repository: pageman/discovery2-results - Full analysis, code, and visualizations
- Source Data: eve-bio/drug-target-activity - Raw drug-target activity data
License
These models are provided for research purposes under CC-BY-NC-SA-4.0 license. Please check with the original data sources for licensing terms.
Contact
For questions or issues, please open a discussion on this repository.
Datasets used to train pageman/discovery2-cytotoxicity-models
Evaluation results
- Log-Likelihood on Discovery 2 Promiscuity ScoresDiscovery 2 Study-313.250
- AIC on Discovery 2 Promiscuity ScoresDiscovery 2 Study634.500
- 50% Cytotoxicity Threshold (hits) on Discovery 2 Promiscuity ScoresDiscovery 2 Study77.400
- Cross-Validation ROC-AUC on Discovery 2 Molecular FingerprintsDiscovery 2 Study0.781
- ROC-AUC Std Dev on Discovery 2 Molecular FingerprintsDiscovery 2 Study0.045