Machine Learning-Driven Analysis of Small Signaling Peptides in Plants

Prediction

S2-PEPANALYST System Diagram — S²-PEPANALYST: Prediction of Small Signalling Peptides

This website provides a comprehensive platform for predicting small signaling peptides (SSPs) in key plant species, including tomato (Solanum lycopersicum), avocado cultivars (Hass and Gwen), and the model organism Arabidopsis thaliana. The tool, S²-PepAnalyst, integrates state-of-the-art machine learning techniques to overcome traditional limitations in SSP prediction, such as reliance on canonical signal peptides.

Key Innovations

Dynamic embedding selection: Reinforcement learning (RL) optimizes the choice of feature embeddings (TAPE and ESM-2) at each iteration, enhancing prediction accuracy for both canonical and non-canonical peptides.

Image-based learning: Embeddings are transformed into 28×28 (TAPE) and 36×36 (ESM-2) matrices, enabling topological analysis via GeoTop. These representations are then processed by a modified LeNet-5 CNN (ProtConv), chosen for its robustness in handling noise, distortion, and local spatial patterns—critical for capturing peptide structural diversity.

Multi-modal integration: The framework concatenates TAPE and ESM-2 embeddings, enriching feature representation before CNN-based classification.

Performance and Applications

S²-PepAnalyst outperforms existing tools (e.g., SignalP 6.0 + downstream complementary computational analysis) with 99.5% accuracy, validated across curated and generalized datasets. It addresses critical challenges in plant biology, such as identifying non-canonical peptides (e.g., PEP1) and classifying peptide families (CLE, RALF, etc.). The tool's architecture leverages RL to dynamically refine predictions, ensuring adaptability to novel peptide discovery.

Classification

The other part of this work focuses on classifying small peptides from different signalling families. The methodology employed consists of:

A comprehensive literature compilation covering all known signalling peptide families from Arabidopsis thaliana (i.e., CEP, CRPs, SCOOPs, RALFs, etc.), along with newly identified signalling peptides, which were incorporated into these families through data mining in well-established databases, including NCBI.
Construct the geometric representation, in 768 dimensions for TAPE and more for ESM, of each protein sequence i to obtain a finite set of points X_i.
Compute the persistence diagrams of the sets X_i to obtain the persistence diagrams for dimensions 0, 1, and 2, denoted as PD_0,1,2(X_i). This feature also facilitates the identification of behavioural peptides that lack a signal peptide region, such as PEP1, and potentially others.
Calculate the distance matrix of dimensions n x n, where the entry (i, j) is the Wasserstein distance W₀ between the persistence diagrams of dimension 0, W₀(PD₀(X_i), PD₀(X_j)).
Thus, if such a distance in Step 4 is null for a given pair, this is akin to using BLAST but invariant to changes of scale. Scale invariance helps in detecting functional domains accurately, regardless of their length, leading to better functional annotation of proteins.

Explore Our GitHub