--- license: cc-by-nc-nd-4.0 --- This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. - `embeddings` folder contains processed huggingface datasets with peptideCLM embeddings. The `.csv` is the pre-processed data. - `metrics` folder contains the model performance on the validation data - `models` host all trained model weights - `training_data` host all **raw data** to train the classifiers - `functions` contains files to utilize the trained weights and classifiers - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs. - `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications # PeptiVerse 🧬🌌 A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation. ## Predictors 🧫 PeptiVerse includes the following property predictors: | Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type | |-----------|-------------|-----------------| --------------------|--------------|------------| | **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings | | **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings | | **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings | | **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors | | **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) | ## Model Performance 🌟 #### Binary Classification Predictors | Predictor | Val AUC | Val F1 | |-----------|----------------|----------| | **Non-Hemolysis** | 0.7902 | 0.8260 | | **Solubility** | 0.6016 | 0.5767 | | **Nonfouling** | 0.9327 | 0.8774 | #### Regression Predictors | Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) | |-----------|------------------------------|----------------------------| | **Permeability** | 0.958 | 0.710 | | **Binding Affinity** | 0.805 | 0.611 | ## Setup 🌟 1. Clone the repository: ```bash git clone https://github.com/sophtang/PeptiVerse.git cd PeptiVerse ``` 2. Install environment: ```bash conda env create -f environment.yml conda activate peptiverse ``` 3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly. ## Usage 🌟 #### 1. Hemolysis Prediction Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides. ```python import sys sys.path.append('/path/to/PeptiVerse') from functions.hemolysis.hemolysis import Hemolysis # Initialize predictor hemo = Hemolysis() # Input peptide in SMILES format peptides = [ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O" ] # Get predictions scores = hemo(peptides) print(f"Non-hemolytic probability: {scores[0]:.3f}") ``` **Output interpretation:** - Score close to 1.0 = likely non-hemolytic (safe) - Score close to 0.0 = likely hemolytic (unsafe) --- #### 2. Solubility Prediction Predicts aqueous solubility. Higher scores indicate better solubility. ```python from functions.solubility.solubility import Solubility # Initialize predictor sol = Solubility() # Input peptide peptides = [ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O" ] # Get predictions scores = sol(peptides) print(f"Solubility probability: {scores[0]:.3f}") ``` **Output interpretation:** - Score close to 1.0 = highly soluble - Score close to 0.0 = poorly soluble --- #### 3. Nonfouling Prediction Predicts protein resistance/non-fouling properties. ```python from functions.nonfouling.nonfouling import Nonfouling # Initialize predictor nf = Nonfouling() # Input peptide peptides = [ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O" ] # Get predictions scores = nf(peptides) print(f"Nonfouling score: {scores[0]:.3f}") ``` **Output interpretation:** - Higher scores = better non-fouling properties --- #### 4. Permeability Prediction Predicts membrane permeability on a log P scale. ```python from functions.permeability.permeability import Permeability # Initialize predictor perm = Permeability() # Input peptide peptides = [ "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O" ] # Get predictions scores = perm(peptides) print(f"Permeability (log P): {scores[0]:.3f}") ``` **Output interpretation:** - Higher values = more permeable - Typical range: -10 to 0 (log scale) --- #### 5. Binding Affinity Prediction Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence. ```python from functions.binding.binding import BindingAffinity # Target protein sequence (amino acid format) target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..." # Initialize predictor with target protein binding = BindingAffinity(prot_seq=target_protein) # Input peptide in SMILES format peptides = [ "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O" ] # Get predictions scores = binding(peptides) print(f"Binding affinity (-log Kd): {scores[0]:.3f}") ``` **Output interpretation:** - Higher values = stronger binding - Scale: -log(Kd/Ki/IC50) - 7.5+ = tight binding (≤ ~30nM) - 6.0-7.5 = medium binding (~30nM - 1μM) - <6.0 = weak binding (> 1μM) --- ## Batch Processing 🌟 All predictors support batch processing for multiple peptides: ```python from functions.hemolysis.hemolysis import Hemolysis hemo = Hemolysis() # Multiple peptides peptides = [ "NCC(=O)N[C@H](CS)C(=O)O", "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O", "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O" ] # Get predictions for all scores = hemo(peptides) for i, score in enumerate(scores): print(f"Peptide {i+1}: {score:.3f}") ``` --- ## Unified Scoring with Multiple Predictors 🌟 For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide. ### Basic Usage ```python import sys sys.path.append('/path/to/PeptiVerse') from scoring_functions import ScoringFunctions # Initialize with desired scoring functions # Available: 'binding_affinity1', 'binding_affinity2', 'permeability', # 'solubility', 'hemolysis', 'nonfouling' scoring = ScoringFunctions( score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'], prot_seqs=[] # Empty if not using binding affinity ) # Input peptides in SMILES format peptides = [ 'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)', 'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O' ] # Get scores (returns numpy array of shape: num_peptides x num_functions) scores = scoring(input_seqs=peptides) print(scores) ``` ### Adding Binding Affinity ```python from scoring_functions import ScoringFunctions # Target protein sequence (amino acid format) tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..." # Initialize with binding affinity for one protein scoring = ScoringFunctions( score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'], prot_seqs=[tfr_protein] # Provide target protein sequence ) peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)'] scores = scoring(input_seqs=peptides) # scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability] print(f"Scores for peptide 1:") print(f" Binding Affinity: {scores[0][0]:.3f}") print(f" Solubility: {scores[0][1]:.3f}") print(f" Hemolysis: {scores[0][2]:.3f}") print(f" Permeability: {scores[0][3]:.3f}") ``` ### Multiple Binding Targets ```python # For dual binding affinity prediction protein1 = "MMDQARSAFSNLFGGEPLSYTR..." # First target protein2 = "MTKSNGEEPKMGGRMERFQQGV..." # Second target scoring = ScoringFunctions( score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'], prot_seqs=[protein1, protein2] # Provide both protein sequences ) peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...'] scores = scoring(input_seqs=peptides) # scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis] ``` ### Output Format The `ScoringFunctions` class returns a numpy array where: - **Rows**: Each row corresponds to one input peptide - **Columns**: Each column corresponds to one scoring function (in the order specified) ```python # Example with 3 peptides and 4 scoring functions scores = scoring(input_seqs=peptides) # Shape: (3, 4) # scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1 # scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2 # scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3 ``` --- ## Complete Example 🌟 ```python import sys sys.path.append('/path/to/PeptiVerse') from functions.hemolysis.hemolysis import Hemolysis from functions.solubility.solubility import Solubility from functions.permeability.permeability import Permeability # Initialize predictors hemo = Hemolysis() sol = Solubility() perm = Permeability() # Test peptide peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"] # Get all predictions hemo_score = hemo(peptide)[0] sol_score = sol(peptide)[0] perm_score = perm(peptide)[0] print("Peptide Property Predictions:") print(f" Hemolysis (non-hemolytic prob): {hemo_score:.3f}") print(f" Solubility: {sol_score:.3f}") print(f" Permeability: {perm_score:.3f}") ``` --- ## Model Architecture 🌟 All predictors use: - **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model) - **Classifier**: XGBoost gradient boosting - **Input**: SMILES representation of peptides - **Training**: Models trained on curated datasets with cross-validation --- ## Citation If you find this repository helpful for your publications, please consider citing our paper: ``` @article{tang2025peptune, title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion}, author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam}, journal={42nd International Conference on Machine Learning}, year={2025} } ``` To use this repository, you agree to abide by the MIT License.