Can ML predictors detect mislabeled amyloids? π§ π§¬
amyloid, AmyloGram, machine learning, bioinformatics, IR spectroscopy, AFM, weak supervision
π Project highlights
- 𧬠Tests robustness of amyloid prediction tools
- β οΈ Investigates mislabeled experimental datasets
- π€ AmyloGram successfully identifies annotation errors
- π¬ Combines ML with IR spectroscopy & AFM validation
- π§ Demonstrates resilience to weak supervision
π New paper out!
What happens when the training data itself is wrong? π
π Paper links
- π Nature Scientific Reports article
- π AmyloGram
π§ Audio summary
Machine learning models are only as good as their training dataβ¦ right? π€
But what if:
- experiments disagree,
- databases contain annotation errors,
- and the model was trained on mislabeled sequences?
This paper tested exactly that.
π¬ What is this about?
Amyloid aggregation is central to diseases such as:
- Alzheimerβs disease
- Parkinsonβs disease
- systemic amyloidoses
But identifying whether a peptide is amyloidogenic is difficult.
Researchers typically rely on:
- π¬ AFM or electron microscopy
- π Thioflavin T staining
- π infrared spectroscopy (IR)
The problem?
π these methods do not always agree.
Especially for:
- oligomers,
- transient aggregates,
- partially aggregated peptides.
This can produce:
β mislabeled databases
β noisy training sets
β weak supervision problems
π€ The core question
Can machine learning models:
π detect errors in their own training data?
Thatβs a very dangerous test for overfitting π
π§ What we tested
The study analyzed:
- AmyloGram
- PATH
- FoldAmyloid
- PASTA 2.0
We have selected:
- peptides strongly agreeing with database labels β
- peptides strongly disagreeing with database labels β
and then re-tested them experimentally.
π¬ Experimental validation
To verify peptide behavior, the team used:
π§ͺ IR spectroscopy
Two complementary IR methods:
- ATR-FTIR
- IR microscopy
π¬ Atomic Force Microscopy (AFM)
Used for direct visualization of aggregates and fibrils.
𧬠The cool part: oligomers
The paper highlights something very important:
π amyloid prediction is NOT binary.
There are at least 3 experimentally relevant classes:
- non-amyloid
- oligomer
- mature amyloid fibril
And oligomers are especially tricky because:
- they are often highly toxic,
- experimentally unstable,
- difficult to classify.
π Key results
β οΈ Massive misannotation discovered
Among 24 experimentally tested βoutlierβ peptides:
- 17 were actually misannotated in databases
That means:
π the ML model was often RIGHT
π the database labels were WRONG π
π€ AmyloGram resisted overfitting
Even though those mislabeled peptides were:
π already present in the training set,
AmyloGram still identified many as suspicious.
This is a huge result.
It suggests:
π§ ML models can sometimes act as quality-control filters for biological databases.
π Spectroscopy + PCA worked beautifully
The paper also used:
- principal component analysis (PCA)
- on IR spectra
to separate:
- amyloids
- non-amyloids
- ambiguous oligomers
π¬ Experimental insights
π§ͺ IR microscopy outperformed ATR-FTIR
The authors conclude that:
π IR microscopy was generally more reliable
because ATR-FTIR could be influenced by:
- water interactions
- sample thickness
- surface effects
𧬠Oligomers complicate annotations
Some sequences behaved differently depending on the method used.
Example:
- SFLIFL formed oligomers but not mature fibrils
This explains why database labels may become inconsistent.
π§ Why this matters
π For machine learning
This paper is a great example of:
π weak supervision in biology.
The training labels are not absolute truth.
π¬ For amyloid research
It shows that:
- experimental methods have biases
- annotation errors are common
- computational tools can help detect inconsistencies
π For bioinformatics pipelines
The work suggests that prediction tools can serve as:
- classifiers
- benchmarkers
- and dataset-cleaning systems
π BioGenies perspective
This paper quietly delivers a very deep message:
π biological databases are messy.
And sometimes:
π€ the model understands the biology better than the labels.
It also nicely connects:
- AmyloGram
- peptide aggregation
- reproducibility
- weak supervision
- experimental spectroscopy
into one coherent story.
