LLPS datasets & benchmarking: bringing order to condensate chaos ๐งฌ๐
LLPS, liquid-liquid phase separation, biomolecular condensates, protein datasets, machine learning, bioinformatics, benchmarking
๐ Project highlights
- ๐งฌ First integrated datasets of LLPS proteins (drivers, clients, negatives)
- โ๏ธ Introduces standardized negative datasets (structured + disordered proteins)
- ๐ Benchmark of 16 LLPS prediction tools - the most comprehensive to date
- ๐ง Reveals limitations of current ML models for phase separation
- ๐ Public resource: datasets + website for reproducible research
๐ New paper out! And this one is a big one, not just another tool, but a foundation for the whole field ๐
๐ Comprehensive protein datasets and benchmarking for liquidโliquid phase separation studies
๐ Data & resources
This is not just a paper, itโs a community resource for building and benchmarking LLPS models.
๐ง Audio summary
Not everyone wants to read about ฮบ parameters and stickerโspacer distributions over coffee โ๐
๐ Weโve attached a short audio explanation ๐ง to make things easier.
๐ฌ What is this about?
Liquidโliquid phase separation (LLPS) is a process where proteins form biomolecular condensates, membrane-less compartments that organize cellular processes.
These condensates are crucial for:
- gene regulation
- stress response
- disease mechanisms (e.g. neurodegeneration)
But studying LLPS computationally has a major problem:
๐ data is messy, inconsistent, and fragmented across databases
๐จ The problem we tackled
Existing LLPS databases:
- use different definitions
- contain inconsistent annotations
- lack reliable negative examples
๐ This makes:
- machine learning unreliable
- benchmarking unfair
- results hard to compare
๐ง What we built
We created high-confidence datasets of proteins involved in LLPS, based on:
๐งฌ Protein roles
- Drivers โ form condensates independently
- Clients โ join existing condensates
- Negatives โ proteins not associated with LLPS
๐ Importantly, we introduced two types of negative datasets:
- structured proteins (PDB-like)
- disordered proteins (DisProt-like)
This is critical because: ๐ disorder alone โ LLPS (big source of bias!)
โ๏ธ How it works
We used:
- ๐ integration of multiple LLPS databases
- ๐งน strict filtering criteria for data quality
- ๐งฌ annotation of:
- disorder
- prion-like domains
- physicochemical features
- disorder
Then we:
๐ analyzed protein properties
๐ and benchmarked prediction tools
๐ Benchmarking LLPS predictors
We evaluated 16 bioinformatics tools on independent datasets.
Key findings:
- ๐ค ML-based tools perform best
- โ ๏ธ Many tools still have high false positive rates
- ๐ง Models often confuse:
- disorder
- with actual phase separation
- disorder
๐ Translation: we are still not great at predicting LLPS reliably ๐
๐งฌ Key biological insights
From the analysis:
- ๐งฉ LLPS is multi-factorial, not driven by a single feature
- โก properties like:
- charge distribution
- aggregation propensity
- interaction regions
matter more than simple sequence features
- charge distribution
- ๐ง drivers and clients are different, but subtle
๐ And importantly: you cannot model LLPS properly without good negative data
๐ Why this matters
This work:
- ๐ sets a new standard for benchmarking
- ๐ง improves machine learning reliability
- ๐ฌ enables better biological interpretation of condensates
๐ In short: we finally have a clean dataset to stop guessing and start learning properly
๐ BioGenies context
This project connects directly to our interests in:
- amyloids ๐งฌ
- protein aggregation ๐ง
- phase separation โก
๐ and gives us a solid foundation for future tools and models
