BioGenies
  • Home
  • Team
    • BioGenies team
    • BioGenies collaborators
    • Guest researchers
    • Former BioGenies members
    • About BioGenies
  • Our projects
    • OneTick
    • AMIโ€‘CryoML
    • AmyloGraph 2.0
    • LIMAD
    • imputomics 2.0
    • FIBREA
  • Software
  • Seminars
  • Publications
  • Conferences etc.
  • Theses and dissertations
  1. LLPS datasets & benchmarking: bringing order to condensate chaos ๐Ÿงฌ๐Ÿ“Š
  • Our topics
    • Amyloids
    • Liquid-liquid phase separation
    • Antimicrobial peptides
    • Missing value imputation
    • HDX-MS

../../

  • ๐Ÿ”— Data & resources
  • ๐ŸŽง Audio summary
  • ๐Ÿ”ฌ What is this about?
  • ๐Ÿšจ The problem we tackled
  • ๐Ÿง  What we built
    • ๐Ÿงฌ Protein roles
  • โš™๏ธ How it works
  • ๐Ÿ“Š Benchmarking LLPS predictors
    • Key findings:
  • ๐Ÿงฌ Key biological insights
  • ๐Ÿš€ Why this matters
  • ๐Ÿ’š BioGenies context

LLPS datasets & benchmarking: bringing order to condensate chaos ๐Ÿงฌ๐Ÿ“Š

publications
LLPS
We present curated datasets of proteins involved in liquidโ€“liquid phase separation (LLPS), enabling robust benchmarking and improved machine learning predictions in condensate biology.
Author

BioGenies Lab

Published

July 9, 2025

Keywords

LLPS, liquid-liquid phase separation, biomolecular condensates, protein datasets, machine learning, bioinformatics, benchmarking


๐Ÿ“Œ Project highlights

  • ๐Ÿงฌ First integrated datasets of LLPS proteins (drivers, clients, negatives)
  • โš–๏ธ Introduces standardized negative datasets (structured + disordered proteins)
  • ๐Ÿ“Š Benchmark of 16 LLPS prediction tools - the most comprehensive to date
  • ๐Ÿง  Reveals limitations of current ML models for phase separation
  • ๐ŸŒ Public resource: datasets + website for reproducible research

๐ŸŽ‰ New paper out! And this one is a big one, not just another tool, but a foundation for the whole field ๐Ÿ˜„

๐Ÿ‘‰ Comprehensive protein datasets and benchmarking for liquidโ€“liquid phase separation studies


๐Ÿ”— Data & resources

This is not just a paper, itโ€™s a community resource for building and benchmarking LLPS models.

  • ๐ŸŒ Dataset website
  • ๐Ÿ’ป GitHub

๐ŸŽง Audio summary

Not everyone wants to read about ฮบ parameters and stickerโ€“spacer distributions over coffee โ˜•๐Ÿ˜„

๐Ÿ‘‰ Weโ€™ve attached a short audio explanation ๐ŸŽง to make things easier.

Your browser does not support the audio element.


๐Ÿ”ฌ What is this about?

Liquidโ€“liquid phase separation (LLPS) is a process where proteins form biomolecular condensates, membrane-less compartments that organize cellular processes.

These condensates are crucial for:

  • gene regulation
  • stress response
  • disease mechanisms (e.g. neurodegeneration)

But studying LLPS computationally has a major problem:

๐Ÿ‘‰ data is messy, inconsistent, and fragmented across databases


๐Ÿšจ The problem we tackled

Existing LLPS databases:

  • use different definitions
  • contain inconsistent annotations
  • lack reliable negative examples

๐Ÿ‘‰ This makes:

  • machine learning unreliable
  • benchmarking unfair
  • results hard to compare

๐Ÿง  What we built

We created high-confidence datasets of proteins involved in LLPS, based on:

๐Ÿงฌ Protein roles

  • Drivers โ†’ form condensates independently
  • Clients โ†’ join existing condensates
  • Negatives โ†’ proteins not associated with LLPS

๐Ÿ‘‰ Importantly, we introduced two types of negative datasets:

  • structured proteins (PDB-like)
  • disordered proteins (DisProt-like)

This is critical because: ๐Ÿ‘‰ disorder alone โ‰  LLPS (big source of bias!)


โš™๏ธ How it works

We used:

  • ๐Ÿ“š integration of multiple LLPS databases
  • ๐Ÿงน strict filtering criteria for data quality
  • ๐Ÿงฌ annotation of:
    • disorder
    • prion-like domains
    • physicochemical features

Then we:

๐Ÿ‘‰ analyzed protein properties
๐Ÿ‘‰ and benchmarked prediction tools


๐Ÿ“Š Benchmarking LLPS predictors

We evaluated 16 bioinformatics tools on independent datasets.

Key findings:

  • ๐Ÿค– ML-based tools perform best
  • โš ๏ธ Many tools still have high false positive rates
  • ๐Ÿง  Models often confuse:
    • disorder
    • with actual phase separation

๐Ÿ‘‰ Translation: we are still not great at predicting LLPS reliably ๐Ÿ˜„


๐Ÿงฌ Key biological insights

From the analysis:

  • ๐Ÿงฉ LLPS is multi-factorial, not driven by a single feature
  • โšก properties like:
    • charge distribution
    • aggregation propensity
    • interaction regions
      matter more than simple sequence features
  • ๐Ÿง  drivers and clients are different, but subtle

๐Ÿ‘‰ And importantly: you cannot model LLPS properly without good negative data


๐Ÿš€ Why this matters

This work:

  • ๐Ÿ“Š sets a new standard for benchmarking
  • ๐Ÿง  improves machine learning reliability
  • ๐Ÿ”ฌ enables better biological interpretation of condensates

๐Ÿ‘‰ In short: we finally have a clean dataset to stop guessing and start learning properly


๐Ÿ’š BioGenies context

This project connects directly to our interests in:

  • amyloids ๐Ÿงฌ
  • protein aggregation ๐ŸงŠ
  • phase separation โšก

๐Ÿ‘‰ and gives us a solid foundation for future tools and models

 

ยฉ 2026 Website developed by BioGenies team.
Privacy Policy

Cookie Preferences