Are AMP models lying to us? π€π§¬ The benchmarking problem
AMP, machine learning, benchmarking, negative dataset, bias, bioinformatics, reproducibility
π Highlights
- π€ Built 660 ML models across 12 architectures
- π§ͺ Tested 11 negative sampling strategies
- β οΈ Shows benchmarking in AMP prediction is biased
- π Performance depends on dataset similarity (not model quality)
- π Introduces AMPBenchmark for fair evaluation
π New paper out! This one tackles a very uncomfortable truth:
π your βstate-of-the-artβ AMP model might just be lucky π
π Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data
π Try it yourself
π finally, a way to benchmark AMP models fairly
π§ Audio summary
Many AMP predictors claim:
π βwe outperform previous methodsβ
Butβ¦ what if the benchmark itself is broken? π
π Hereβs a short audio overview π§ explaining the problem:
π¬ What is this about?
Antimicrobial peptides (AMPs):
- 𧬠short bioactive peptides
- π¦ kill bacteria, viruses, cancer cells
- π promising alternative to antibiotics
π Because of that:
- dozens of ML predictors exist
- each claiming better performance
π‘ BUT:
π all of them depend on training data
especially:
π negative data (non-AMPs)
β οΈ The core problem
There are:
- thousands of known AMPs β
- almost no confirmed non-AMPs β
π so researchers:
- generate artificial negatives
- using different sampling strategies
π₯ Problem:
π these strategies create very different datasets
And ML models:
π learn dataset artifacts, not biology
π§ What they did
π§ͺ Massive benchmark
- 12 ML architectures
- 11 negative sampling methods
- 660 models total
π Full cross-evaluation
Each model:
- trained on one dataset
- tested on ALL others
π not just βfriendly benchmarksβ
π Evaluation metric
- ROC curves
- AUC scores
π Key results
β οΈ Benchmarking is biased
π Models perform best when:
- training set = benchmark set
π Performance drops when:
- datasets differ
π meaning:
models donβt generalize, they memorize dataset structure
𧬠Dataset similarity drives performance
Strong correlations found between:
- amino acid composition similarity
- sequence length similarity
and model performance
π not biology
π just dataset artifacts
π€ Architecture mattersβ¦ but less than you think
- Random Forest models performed best
- Deep learning β automatically better
π but dataset choice still dominates
π Reproducibility crisis π¨
- ~70% models not reproducible
- lack of code / data sharing
π slows down the field
π‘ Key insight
π We donβt actually know:
which AMP model is the best
Because:
- benchmarks are biased
- comparisons are unfair
- datasets are inconsistent
π βstate-of-the-artβ = often dataset-specific
π What we have built
π AMPBenchmark
A platform for:
- fair model comparison
- standardized datasets
- reproducible evaluation
π similar idea to:
- Kaggle-style benchmarking
π solves:
- hidden bias
- unfair comparisons
- reproducibility issues
π Why this matters
π§ For ML in biology
This is not just AMP problem:
π affects:
- protein function prediction
- interaction prediction
- genomics ML
β οΈ For researchers
You should:
- question benchmarks
- test cross-dataset generalization
- avoid overfitting to dataset design
π For drug discovery
Bad models =
π missed therapeutic candidates
π false positives
π BioGenies perspective
This paper is π₯ because it says:
π βyour model is not as good as you thinkβ
And more importantly:
π shows WHY
