Discovery 2: Cytotoxicity Prediction Models

Pre-trained models for predicting drug cytotoxicity based on promiscuity and molecular structure.

Models Overview

This repository contains trained models from the Discovery 2 study on selectivity-safety coupling. The models predict cytotoxicity risk based on:

  1. Promiscuity (number of biological targets a compound hits)
  2. Molecular structure (Morgan fingerprints)

Model Files

1. Cubic Logistic Regression Models

Main Model: cubic_logistic_model.pkl

  • Predicts cytotoxicity probability from overall promiscuity score
  • Uses cubic polynomial features (promiscuity, promiscuity², promiscuity³)
  • 50% threshold: 77 hits
  • Performance: Significantly better than linear model (p < 0.001)

Class-Specific Models:

  • kinase_cubic_model.pkl - For kinase promiscuity (50% threshold: 25 hits)
  • nr_cubic_model.pkl - For nuclear receptor promiscuity (50% threshold: 31 hits)
  • 7tm_cubic_model.pkl - For GPCR/7TM promiscuity (50% threshold: 63 hits)

2. LightGBM Structural Model

File: lightgbm_model.txt

  • Predicts cytotoxicity from molecular fingerprints (2048-bit Morgan fingerprints, radius=2)
  • Cross-validation AUC: 0.781 ± 0.045
  • Identifies structural alerts (toxicophores)
  • Can be used for compounds without promiscuity data

3. Metadata Files

  • cubic_model_metadata.json - Performance metrics for main cubic model
  • class_models_metadata.json - Thresholds for class-specific models
  • lgb_model_metadata.json - LightGBM model performance
  • feature_stats.json - Feature statistics for normalization

Usage

Installation

pip install joblib lightgbm rdkit numpy pandas statsmodels

Loading Models

import joblib
import lightgbm as lgb
import json

# Load cubic logistic regression model
cubic_model = joblib.load('cubic_logistic_model.pkl')

# Load LightGBM model
lgb_model = lgb.Booster(model_file='lightgbm_model.txt')

# Load metadata
with open('cubic_model_metadata.json', 'r') as f:
    cubic_metadata = json.load(f)
    
with open('feature_stats.json', 'r') as f:
    feature_stats = json.load(f)

Predicting from Promiscuity Score

import numpy as np
from statsmodels.tools import add_constant

def predict_cytotoxicity_from_promiscuity(promiscuity_score, model):
    """
    Predict cytotoxicity probability from promiscuity score
    
    Args:
        promiscuity_score: Number of active assays (hits)
        model: Loaded cubic logistic regression model
        
    Returns:
        Probability of cytotoxicity (0-1)
    """
    # Create cubic features
    X = np.array([[promiscuity_score, 
                   promiscuity_score**2, 
                   promiscuity_score**3]])
    X_with_const = add_constant(X)
    
    # Predict probability
    prob = model.predict(X_with_const)[0]
    return prob

# Example usage
promiscuity = 50
prob = predict_cytotoxicity_from_promiscuity(promiscuity, cubic_model)
print(f"Promiscuity: {promiscuity} hits")
print(f"Cytotoxicity probability: {prob:.2%}")

Predicting from Molecular Structure

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

def predict_cytotoxicity_from_smiles(smiles, model):
    """
    Predict cytotoxicity from SMILES string
    
    Args:
        smiles: SMILES representation of molecule
        model: Loaded LightGBM model
        
    Returns:
        Probability of cytotoxicity (0-1)
    """
    # Generate Morgan fingerprint
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        raise ValueError("Invalid SMILES")
    
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
    fp_array = np.array(fp).reshape(1, -1)
    
    # Predict
    prob = model.predict(fp_array)[0]
    return prob

# Example usage
smiles = "CC(C)Cc1ccc(cc1)C(C)C(O)=O"  # Ibuprofen
prob = predict_cytotoxicity_from_smiles(smiles, lgb_model)
print(f"SMILES: {smiles}")
print(f"Cytotoxicity probability: {prob:.2%}")

Class-Specific Predictions

# Load class-specific model
kinase_model = joblib.load('kinase_cubic_model.pkl')

# Predict from kinase-specific promiscuity
kinase_hits = 20
prob = predict_cytotoxicity_from_promiscuity(kinase_hits, kinase_model)
print(f"Kinase promiscuity: {kinase_hits} hits")
print(f"Cytotoxicity probability: {prob:.2%}")

Risk Interpretation

Based on the cubic model thresholds:

Promiscuity Range Risk Level Cytotoxicity Probability
< 43 hits Low < 25%
43-102 hits Moderate 25-75%
> 102 hits High > 75%

Class-Specific 50% Thresholds:

  • Kinase: 25 hits (most sensitive)
  • Nuclear Receptor: 31 hits
  • 7TM/GPCR: 63 hits (least sensitive)

Model Performance

Cubic Logistic Regression

  • Log-likelihood: -313.25
  • AIC: 634.50
  • Likelihood ratio test: p = 0.001 (significantly better than linear)
  • 50% threshold: 77.4 hits

LightGBM Classifier

  • Cross-validation AUC: 0.781 ± 0.045
  • Features: 2048-bit Morgan fingerprints (radius=2)
  • Training set: 1,382 compounds (13.2% cytotoxic)

Key Findings

  1. Non-linear relationship: Cytotoxicity risk accelerates rapidly above ~50 hits
  2. Strong predictive power: Compounds with >50 hits are 29.4× more likely to be cytotoxic
  3. Target class matters: Kinase promiscuity is more dangerous than GPCR promiscuity
  4. Structural alerts: Specific molecular substructures (toxicophores) are highly predictive

Use Cases

  • Early drug discovery: Screen compounds for cytotoxicity risk
  • Lead optimization: Prioritize compounds with lower predicted risk
  • Polypharmacology assessment: Evaluate safety implications of multi-target drugs
  • Structure-activity relationships: Identify problematic structural features

Limitations

  • Models trained on specific compound library (1,397 FDA-approved small molecules)
  • Cytotoxicity measured by cell viability assays (may not capture all toxicity mechanisms)
  • Promiscuity-based models require activity data across multiple targets
  • Structure-based model limited to compounds within chemical space of training data

Citation

If you use these models in your research, please cite:

Discovery 2: Cytotoxicity Prediction Models
Models: https://huggingface.co/pageman/discovery2-cytotoxicity-models
Dataset: https://huggingface.co/datasets/pageman/discovery2-results

Related Resources

License

These models are provided for research purposes under CC-BY-NC-SA-4.0 license. Please check with the original data sources for licensing terms.

Contact

For questions or issues, please open a discussion on this repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train pageman/discovery2-cytotoxicity-models

Evaluation results