BioGenies
  • Home
  • Team
    • BioGenies team
    • BioGenies collaborators
    • Guest researchers
    • Former BioGenies members
    • About BioGenies
  • Our projects
    • OneTick
    • AMI‑CryoML
    • AmyloGraph 2.0
    • LIMAD
    • imputomics 2.0
    • FIBREA
  • Software
  • Seminars
  • Publications
  • Conferences etc.
  • Theses and dissertations
  1. Are AMP models lying to us? πŸ€–πŸ§¬ The benchmarking problem
  • Our topics
    • Amyloids
    • Liquid-liquid phase separation
    • Antimicrobial peptides
    • Missing value imputation
    • HDX-MS

../../

  • πŸ”— Try it yourself
  • 🎧 Audio summary
  • πŸ”¬ What is this about?
  • ⚠️ The core problem
  • 🧠 What they did
    • πŸ§ͺ Massive benchmark
    • πŸ” Full cross-evaluation
    • πŸ“Š Evaluation metric
  • πŸ” Key results
    • ⚠️ Benchmarking is biased
    • 🧬 Dataset similarity drives performance
    • πŸ€– Architecture matters… but less than you think
    • πŸ” Reproducibility crisis 🚨
  • πŸ’‘ Key insight
  • πŸš€ What we have built
    • 🌐 AMPBenchmark
  • πŸš€ Why this matters
    • 🧠 For ML in biology
    • ⚠️ For researchers
    • πŸ’Š For drug discovery
  • πŸ’š BioGenies perspective

Are AMP models lying to us? πŸ€–πŸ§¬ The benchmarking problem

publications
peptides
A large-scale study reveals that antimicrobial peptide predictors are heavily biased due to how negative datasets are constructed.
Author

BioGenies Lab

Published

September 1, 2022

Keywords

AMP, machine learning, benchmarking, negative dataset, bias, bioinformatics, reproducibility

πŸ“Œ Highlights

  • πŸ€– Built 660 ML models across 12 architectures
  • πŸ§ͺ Tested 11 negative sampling strategies
  • ⚠️ Shows benchmarking in AMP prediction is biased
  • πŸ“Š Performance depends on dataset similarity (not model quality)
  • 🌐 Introduces AMPBenchmark for fair evaluation

πŸŽ‰ New paper out! This one tackles a very uncomfortable truth:

πŸ‘‰ your β€œstate-of-the-art” AMP model might just be lucky πŸ˜„

πŸ‘‰ Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data


πŸ”— Try it yourself

  • 🌐 Web server
  • πŸ’» GitHub

πŸ‘‰ finally, a way to benchmark AMP models fairly


🎧 Audio summary

Many AMP predictors claim:

πŸ‘‰ β€œwe outperform previous methods”

But… what if the benchmark itself is broken? πŸ˜…

πŸ‘‰ Here’s a short audio overview 🎧 explaining the problem:

Your browser does not support the audio element.

πŸ‘‰ Perfect if you want the big picture of ML bias in bioinformatics


πŸ”¬ What is this about?

Antimicrobial peptides (AMPs):

  • 🧬 short bioactive peptides
  • 🦠 kill bacteria, viruses, cancer cells
  • πŸ’Š promising alternative to antibiotics

πŸ‘‰ Because of that:

  • dozens of ML predictors exist
  • each claiming better performance

πŸ’‘ BUT:

πŸ‘‰ all of them depend on training data

especially:

πŸ‘‰ negative data (non-AMPs)


⚠️ The core problem

There are:

  • thousands of known AMPs βœ…
  • almost no confirmed non-AMPs ❌

πŸ‘‰ so researchers:

  • generate artificial negatives
  • using different sampling strategies

πŸ’₯ Problem:

πŸ‘‰ these strategies create very different datasets

And ML models:

πŸ‘‰ learn dataset artifacts, not biology


🧠 What they did

πŸ§ͺ Massive benchmark

  • 12 ML architectures
  • 11 negative sampling methods
  • 660 models total

πŸ” Full cross-evaluation

Each model:

  • trained on one dataset
  • tested on ALL others

πŸ‘‰ not just β€œfriendly benchmarks”


πŸ“Š Evaluation metric

  • ROC curves
  • AUC scores

πŸ” Key results

⚠️ Benchmarking is biased

πŸ‘‰ Models perform best when:

  • training set = benchmark set

πŸ“‰ Performance drops when:

  • datasets differ

πŸ‘‰ meaning:

models don’t generalize, they memorize dataset structure


🧬 Dataset similarity drives performance

Strong correlations found between:

  • amino acid composition similarity
  • sequence length similarity

and model performance

πŸ‘‰ not biology
πŸ‘‰ just dataset artifacts


πŸ€– Architecture matters… but less than you think

  • Random Forest models performed best
  • Deep learning β‰  automatically better

πŸ‘‰ but dataset choice still dominates


πŸ” Reproducibility crisis 🚨

  • ~70% models not reproducible
  • lack of code / data sharing

πŸ‘‰ slows down the field


πŸ’‘ Key insight

πŸ‘‰ We don’t actually know:

which AMP model is the best

Because:

  • benchmarks are biased
  • comparisons are unfair
  • datasets are inconsistent

πŸ‘‰ β€œstate-of-the-art” = often dataset-specific


πŸš€ What we have built

🌐 AMPBenchmark

A platform for:

  • fair model comparison
  • standardized datasets
  • reproducible evaluation

πŸ‘‰ similar idea to:

  • Kaggle-style benchmarking

πŸ‘‰ solves:

  • hidden bias
  • unfair comparisons
  • reproducibility issues

πŸš€ Why this matters

🧠 For ML in biology

This is not just AMP problem:

πŸ‘‰ affects:

  • protein function prediction
  • interaction prediction
  • genomics ML

⚠️ For researchers

You should:

  • question benchmarks
  • test cross-dataset generalization
  • avoid overfitting to dataset design

πŸ’Š For drug discovery

Bad models =

πŸ‘‰ missed therapeutic candidates
πŸ‘‰ false positives


πŸ’š BioGenies perspective

This paper is πŸ”₯ because it says:

πŸ‘‰ β€œyour model is not as good as you think”

And more importantly:

πŸ‘‰ shows WHY

 

Β© 2026 Website developed by BioGenies team.
Privacy Policy

Cookie Preferences