BioGenies
  • Home
  • Team
    • BioGenies team
    • BioGenies collaborators
    • Guest researchers
    • Former BioGenies members
    • About BioGenies
  • Our projects
    • OneTick
    • AMI‑CryoML
    • AmyloGraph 2.0
    • LIMAD
    • imputomics 2.0
    • FIBREA
  • Software
  • Seminars
  • Publications
  • Conferences etc.
  • Theses and dissertations
  1. Can ML predictors detect mislabeled amyloids? 🧠🧬
  • Our topics
    • Amyloids
    • Liquid-liquid phase separation
    • Antimicrobial peptides
    • Missing value imputation
    • HDX-MS

../../

  • πŸ”— Paper links
  • 🎧 Audio summary
  • πŸ”¬ What is this about?
  • πŸ€– The core question
  • 🧠 What we tested
  • πŸ”¬ Experimental validation
    • πŸ§ͺ IR spectroscopy
    • πŸ”¬ Atomic Force Microscopy (AFM)
  • 🧬 The cool part: oligomers
  • πŸ“Š Key results
    • ⚠️ Massive misannotation discovered
    • πŸ€– AmyloGram resisted overfitting
  • πŸ“ˆ Spectroscopy + PCA worked beautifully
  • πŸ”¬ Experimental insights
    • πŸ§ͺ IR microscopy outperformed ATR-FTIR
    • 🧬 Oligomers complicate annotations
  • 🧠 Why this matters
    • πŸ“š For machine learning
    • πŸ”¬ For amyloid research
    • πŸš€ For bioinformatics pipelines
  • πŸ’š BioGenies perspective

Can ML predictors detect mislabeled amyloids? 🧠🧬

publications
amyloids
A fascinating study showing that amyloid prediction tools can identify misannotated training data, even when those wrong labels were part of the training set.
Author

BioGenies Lab

Published

April 26, 2021

Keywords

amyloid, AmyloGram, machine learning, bioinformatics, IR spectroscopy, AFM, weak supervision


πŸ“Œ Project highlights

  • 🧬 Tests robustness of amyloid prediction tools
  • ⚠️ Investigates mislabeled experimental datasets
  • πŸ€– AmyloGram successfully identifies annotation errors
  • πŸ”¬ Combines ML with IR spectroscopy & AFM validation
  • 🧠 Demonstrates resilience to weak supervision

πŸŽ‰ New paper out!

What happens when the training data itself is wrong? πŸ˜…

πŸ‘‰ Bioinformatics methods for identification of amyloidogenic peptides show robustness to misannotated training data


πŸ”— Paper links

  • πŸ“„ Nature Scientific Reports article
  • 🌐 AmyloGram

🎧 Audio summary

Machine learning models are only as good as their training data… right? πŸ€”

But what if:

  • experiments disagree,
  • databases contain annotation errors,
  • and the model was trained on mislabeled sequences?

This paper tested exactly that.

Your browser does not support the audio element.


πŸ”¬ What is this about?

Amyloid aggregation is central to diseases such as:

  • Alzheimer’s disease
  • Parkinson’s disease
  • systemic amyloidoses

But identifying whether a peptide is amyloidogenic is difficult.

Researchers typically rely on:

  • πŸ”¬ AFM or electron microscopy
  • 🌈 Thioflavin T staining
  • πŸ“ˆ infrared spectroscopy (IR)

The problem?

πŸ‘‰ these methods do not always agree.

Especially for:

  • oligomers,
  • transient aggregates,
  • partially aggregated peptides.

This can produce:

❌ mislabeled databases
❌ noisy training sets
❌ weak supervision problems


πŸ€– The core question

Can machine learning models:

πŸ‘‰ detect errors in their own training data?

That’s a very dangerous test for overfitting πŸ˜„


🧠 What we tested

The study analyzed:

  • AmyloGram
  • PATH
  • FoldAmyloid
  • PASTA 2.0

We have selected:

  • peptides strongly agreeing with database labels βœ…
  • peptides strongly disagreeing with database labels ❌

and then re-tested them experimentally.


πŸ”¬ Experimental validation

To verify peptide behavior, the team used:

πŸ§ͺ IR spectroscopy

Two complementary IR methods:

  • ATR-FTIR
  • IR microscopy

πŸ”¬ Atomic Force Microscopy (AFM)

Used for direct visualization of aggregates and fibrils.


🧬 The cool part: oligomers

The paper highlights something very important:

πŸ‘‰ amyloid prediction is NOT binary.

There are at least 3 experimentally relevant classes:

  • non-amyloid
  • oligomer
  • mature amyloid fibril

And oligomers are especially tricky because:

  • they are often highly toxic,
  • experimentally unstable,
  • difficult to classify.

πŸ“Š Key results

⚠️ Massive misannotation discovered

Among 24 experimentally tested β€œoutlier” peptides:

  • 17 were actually misannotated in databases

That means:

πŸ‘‰ the ML model was often RIGHT
πŸ‘‰ the database labels were WRONG πŸ˜…


πŸ€– AmyloGram resisted overfitting

Even though those mislabeled peptides were:

πŸ‘‰ already present in the training set,

AmyloGram still identified many as suspicious.

This is a huge result.

It suggests:

🧠 ML models can sometimes act as quality-control filters for biological databases.


πŸ“ˆ Spectroscopy + PCA worked beautifully

The paper also used:

  • principal component analysis (PCA)
  • on IR spectra

to separate:

  • amyloids
  • non-amyloids
  • ambiguous oligomers

πŸ”¬ Experimental insights

πŸ§ͺ IR microscopy outperformed ATR-FTIR

The authors conclude that:

πŸ‘‰ IR microscopy was generally more reliable

because ATR-FTIR could be influenced by:

  • water interactions
  • sample thickness
  • surface effects

🧬 Oligomers complicate annotations

Some sequences behaved differently depending on the method used.

Example:

  • SFLIFL formed oligomers but not mature fibrils

This explains why database labels may become inconsistent.


🧠 Why this matters

πŸ“š For machine learning

This paper is a great example of:

πŸ‘‰ weak supervision in biology.

The training labels are not absolute truth.


πŸ”¬ For amyloid research

It shows that:

  • experimental methods have biases
  • annotation errors are common
  • computational tools can help detect inconsistencies

πŸš€ For bioinformatics pipelines

The work suggests that prediction tools can serve as:

  • classifiers
  • benchmarkers
  • and dataset-cleaning systems

πŸ’š BioGenies perspective

This paper quietly delivers a very deep message:

πŸ‘‰ biological databases are messy.

And sometimes:

πŸ€– the model understands the biology better than the labels.

It also nicely connects:

  • AmyloGram
  • peptide aggregation
  • reproducibility
  • weak supervision
  • experimental spectroscopy

into one coherent story.


 

Β© 2026 Website developed by BioGenies team.
Privacy Policy

Cookie Preferences