Introducing FoldFlow-2: a state-of-the-art protein structure generative model that uses both protein sequence and structure to unlock huge potential for protein design.
Proteins are essential for almost all biological processes, and designing proteins to solve specific problems will depend on understanding both their 3D structure and amino acid sequence. However, despite this interdependence, protein design generative models have usually tackled structure and sequence separately.
To address this shortcoming, our team developed FoldFlow-2: a protein design model that builds on the success of our previous FoldFlow model while taking full advantage of sequence and structure during training and protein generation. The result is a new state-of-the-art in generating designable (a measure of generation quality), novel proteins, and significant new capabilities for guiding protein design based on sequence and structure information.
Our approach to designing FoldFlow-2 was to combine our expertise in flow-matching models for protein structure generation with the latest architectures and best practices in protein design models. This led us to building the FoldFlow-2 architecture with the following parts:
The result is a model that shares some similarities with popular protein folding models like ESMFold[1] but FoldFlow-2 is trained to generate new proteins.
Most prior protein generation models were only trained on "experimentally validated" protein structures from the Protein Data Bank (PDB) but collecting these data is slow and costly, which limits scalability. At DreamFold, we developed techniques for leveraging predicted protein structures from folding models in our training data.
One key innovation is filtering: although folding models are constantly improving, there are many examples of "low-quality, high-confidence" folding predictions that need to be filtered out to achieve good training data quality. We use filtering models to predict training example quality and remove bad training data. This allows us to train at scale on a dataset which is 8 times larger than previous datasets.
During training, the model saw the "true" sequence for 50% of the training examples, with the other 50% of the time the model saw placeholder inputs for the sequence. This ensured that the model learned to incorporate sequence information but not rely on it too much since, for some tasks, we have only partial sequence data — or none at all.
We found that this mixed training seems to help even when we don’t provide a sequence explicitly to the model; FoldFlow-2 is able to indirectly use information from sequences seen during training to generate better structures. We trained FoldFlow-2 using cloud GPUs with Anyscale.
FoldFlow-2 is highly flexible and unlocks new capabilities compared to FoldFlow. We highlight two results we’re particularly excited to share:
We explore these and other tasks in our paper.
Generating new proteins is a key challenge for generative protein models: can a model produce new, diverse, and designable (recall: a measure of generation quality) protein structures? To evaluate our model in these areas, we use computational metrics that are correlated with experimental measurements; our experiments show that almost all of FoldFlow-2’s generations are designable, and significantly more of them are new and/or diverse compared to other models.
Images are coloured by secondary structure with alpha helices in blue, beta sheets in red, and coils in green.
In many applications, we know partial information about a protein and need to design using that information. One example is the "motif-scaffolding problem", where a functional part of a protein, called the "motif", is identified and we need to design the rest of the protein, called the "scaffold". This problem is important for vaccine design[2] and other applications.
We fine-tuned FoldFlow-2 to perform motif scaffolding by accepting the partial structure and sequence information of a motif, and "filling in" the rest of the structure with the scaffold. We tested this fine-tuned FoldFlow-2 model on the difficult task of antibody motif-scaffolding, and our model performs well against the previous state-of-the-art model, RFDiffusion[3]. We think this is just the beginning of applications for hybrid structure and sequence protein design.
FoldFlow-2 represents a significant advance in computational protein design by natively integrating protein sequence and structure into training and protein generation. We use state-of-the-art model components and large-scale data to obtain a world-class model capable of generating novel proteins and protein designs.
We look forward to continuing this progress into the lab and experimentally investigating the performance of our designs, as well as pursuing even more protein design problems. Please check out our paper for lots more details.
[1] Zeming Lin et al., "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379, 1123-1130 (2023). DOI:10.1126/science.ade2574
[2]Correia, Bruno E et al., “Proof of principle for epitope-focused vaccine design.” Nature vol. 507,7491 (2014): 201-6. DOI:10.1038/nature12966
[3] Watson, J.L., Juergens, D., Bennett, N.R. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023). DOI:10.1038/s41586-023-06415-8