STARLING takes flight

Excited to announce the newest member of the flock - STARLING (conSTruction of intrinsicAlly disoRdered proteins ensembles efficientLy vIa multi-dimeNsional Generative models) (Novak* et al., 2025)

STARLING is a generative model for the accurate prediction of coarse-grained disordered protein conformation ensembles.

STARLING is a collaborative project spearheaded by Borna Novak and myself which builds upon the lab’s foundational work of IDR conformational ensemble property prediction directly from sequence (Lotthammer* et al., 2024).

While previous deep learning approaches have focused on predicting average values for some subset of observables (e.g. end-to-end distance), they are limited by which observables have predictive models.

STARLING presents a generalization of this recent work by enabling the generation of IDR ensemble from which any observable and its distribution can be computed.

STARLING is a latent denoising diffusion model inspired by recent progress in text-to-image generative models.

We formulate IDR ensemble construction as a process of generating instantaneous distance maps in a sequence-conditioned manner, where each map represents a structure based on pairwise inter-residue distances.

STARLING produces high-quality predictions at a blazingly fast rate on GPUs and Apple Silicon and is still performant on CPUs.

We benchmark STARLING against decades of elegant biophysical research of disordered proteins, including smFRET, SAXS, and NMR experiments, and find that STARLING displays remarkable agreement.

STARLING dramatically lowers the barrier to the computational interrogation of IDR function through the lens of emergent biophysical properties in addition to traditional bioinformatic approaches.

STARLING can be used to develop hypotheses as to how an IDR’s sequence may determine its conformational ensemble and/or how it may influence interactions with other IDRs.

We also show how one can integrate STARLING with protein design tools to build de novo disordered protein sequences with target ensemble properties.

Importantly, STARLING is an open-source tool targeting ease of use and widespread availability. STARLING is available to install and run locally or online through a simple interface via Google Colab.

Installation

You should really go on over to the github for this information, but… since I’m here I wanted to give a little demo.

I recommend creating a fresh conda environment for STARLING (although in principle there’s nothing special about the STARLING environment)

conda create -n starling  python=3.11 -y
conda activate starling

You can then install STARLING from GitHub directly using pip:

pip install idptools-starling

Or you can clone and install the bleeding-edge version from GitHub

git clone git@github.com:idptools/starling.git
cd starling
pip install .

To check STARLING has installed correctly run

starling --help

Quickstart

The easiest way to use STARLING is the starling command-line tool.

starling <amino acid sequence> -c <number of confomers> --outname my_cool_idr

This will generate an output file call my_cool_idr.starling. To convert this to a PDB trajectory run

starling2pdb my_cool_idr.starling

Or to convert to an xtc/pdb combo run:

starling2xtc my_cool_idr.starling

Python library

STARLING can generate Ensemble objects which enable deep investigation into ensemble properties using the generate function.

`generate` function documentation

The generate function is the main entry point for generating distance maps using the STARLING model. This function accepts various input types, generates conformations using DDPM, and optionally returns the 3D structures. You can customize several parameters for batch size, device, number of steps, and more.

To get started, first import the function:

from starling import generate

The generate function is flexible and can take in sequences in multiple formats. Here are a few examples:

# Example 1: Provide a single sequence as a string
sequence = 'MKVIFLAVLGLGIVVTTVLY'

# E is an Ensemble() object
E = generate(sequence, return_single_ensemble=True)


# Example 2: Provide a list of sequences
sequences = ['MKVIFLAVLGLGIVVTTVLY', 'MKVIFLAVLGLGIVVTTVLY']

# returns a dictionary of the Ensemble() objects
E_dict = generate(sequences)

# Example 3: Provide a dictionary of sequences

# returns a dictionary of the Ensemble() objects
sequences = {'seq1': 'MKVIFLAVLGLGIVVTTVLY', 'seq2': 'MKVIFLAVLGLGIVVTTVLY'}

E_dict = generate(sequences)

Intrinsically disordered proteins and regions (collectively IDRs) are found across all kingdoms of life and play critical roles in virtually every eukaryotic cellular process. In contrast to folded proteins, IDRs lack a stable 3D structure and are instead described in terms of a conformational ensemble, a collection of energetically accessible interconverting structures. This unique structural plasticity facilitates diverse molecular recognition and function; thus, a convenient way to view IDRs is through their ensembles. Here, we combine advances in physics-based force fields for IDPs with the power of modern multi-scale generative modeling to develop STARLING, an approach for the rapid and accurate prediction of IDR ensembles directly from sequence. STARLING enables ensembles of hundreds of conformers to be generated in seconds and works on GPUs and CPUs. This, in turn, dramatically lowers the barrier to the computational interrogation of IDR function through the lens of emergent biophysical properties in addition to traditional bioinformatic approaches. We evaluate STARLING’s accuracy against experimental data and offer a series of vignettes illustrating how STARLING can enable rapid hypothesis generation for IDR function or the interpretation of experimental data.

Intrinsically disordered regions (IDRs) are ubiquitous across all domains of life and play a range of functional roles. While folded domains are generally well described by a stable three-dimensional structure, IDRs exist in a collection of interconverting states known as an ensemble. This structural heterogeneity means that IDRs are largely absent from the Protein Data Bank, contributing to a lack of computational approaches to predict ensemble conformational properties from sequence. Here we combine rational sequence design, large-scale molecular simulations and deep learning to develop ALBATROSS, a deep-learning model for predicting ensemble dimensions of IDRs, including the radius of gyration, end-to-end distance, polymer-scaling exponent and ensemble asphericity, directly from sequences at a proteome-wide scale. ALBATROSS is lightweight, easy to use and accessible as both a locally installable software package and a point-and-click-style interface via Google Colab notebooks. We first demonstrate the applicability of our predictors by examining the generalizability of sequence-ensemble relationships in IDRs. Then, we leverage the high-throughput nature of ALBATROSS to characterize the sequence-specific biophysical behavior of IDRs within and between proteomes.

Accurate predictions of conformational ensembles of disordered proteins

STARLING takes flight

Installation

Quickstart

Python library

`generate` function documentation

References

2025

2024

STARLING takes flight

Installation

Quickstart

Python library

generate function documentation

References

2025

2024

`generate` function documentation