Anyone who has experimented with AI models like OpenAI’s DALL-E 2 knows how fun it can be to create offbeat images like penguins playing soccer on the moon. The company’s ChatGPT language model can generate passable philosophy essays or epic poems in the style of Werner Herzog with equal ease.
Now biotech labs are putting this same generative AI technology toward a more serious goal: designing proteins never seen in nature. These protein generation programs can turn a process that takes months or years into a matter of minutes, and could help develop new, more effective vaccines and medicines.
Proteins are the Lego bricks of living systems, enabling most biological functions from metabolism to DNA replication. They also play a crucial role in the immune system, which is why many modern drugs are based on proteins. For example, the COVID-19 vaccine focused on how the coronavirus uses a spike protein to attach to human cells.
Drug designers looking for new treatments typically sift through thousands of proteins created by evolution over millions of years, a long, expensive and inefficient process. Computational protein design – using computers to create a “recipe” for the combination of amino acids that make up a protein molecule – has made it possible to create proteins from scratch, or modify existing ones. Proteins can be designed for specific uses, opening up an essentially infinite library of ingredients for advances in biomedicine and bioengineering.
In recent years, researchers have started putting advanced AI models to use. In 2020, DeepMind announced its protein-folding AI AlphaFold could predict the shape of a protein to within the width of an atom, a problem that has frustrated biochemists for decades. (How a protein folds in 3 dimensions determines its biological function.)
In December 2022, a team at the University of Washington led by biologist David Baker announced a program called RoseTTAFold Diffusion (RFdiffusion), which can generate designs for new proteins with much higher speed and precision than ever before.
Diffusion models, introduced in 2015, are machine-learning algorithms that specialize in adding and removing noise. This is what DALL-E uses to create high-quality images on demand: by gradually removing pixels from a random grouping until a new image is formed that matches whatever was requested. RFdiffusion operates on a similar design, generating complex proteins from a simple set of specifications. The model was trained for about four weeks using 64 Nvidia V100 GPUs on Microsoft Azure.
RFdiffusion speeds up the process of protein design by multiple orders of magnitude compared to existing design methods. Coming up with the blueprint for a single protein with 100 amino acids takes about 2.5 minutes and about eight gigabytes of memory on a Nvidia RTX A4000 GPU, said Brian Trippe, a postdoctoral fellow in statistics at Columbia and part of the UW group.
“You’re doing really memory-intensive matrix multiplication operations,” Trippe explained. “But then what comes out is only on the order of kilobytes, depending on how big the protein is—it’s basically a CSV file of 3D coordinates.”
Designing proteins that never existed in nature is just the first step. The next is seeing if they work in the real world. The UW team set a list of goals, such as creating a symmetric protein, or one that binds to a specific site on another molecule. Not every AI-generated protein worked as designed, but by testing hundreds, the team eventually was able to meet every design goal.
“We had an astronomical experimental success rate,” Trippe said. “We weren’t expecting that to happen, but we were very happy that it did.”
One standout result came when they tasked the model with creating a protein that attaches to the parathyroid hormone, which controls blood calcium levels. The model generated a design that, in lab tests, bound to the hormone more tightly than anything another computational method could have generated – and tighter than existing drugs.
The technology has the potential to be a game changer in biomedicine. It could be used to design protein-targeting drugs much more quickly, even if the process of manufacturing and testing still takes years.
“We have a very powerful tool we can use to make molecules quickly and efficiently,” Trippe said. “We’re always trying to push the boundary of what’s possible, and we think that RFdiffusion is going to push that boundary quite a bit further.”
The result could be better vaccines, medications for things like cancer immunotherapy, even new nanomaterials.
Could this tool be used for evil, to cook up poisons or other dangerous creations?
“That has definitely crossed my mind,” said David Juergens of the University of Washington, also a member of the Baker Lab. “That can happen with basically every new technology. But overall I think the fact that it’s public knowledge that this exists is a good thing.”
The UW team is working to improve the model, making it even faster and more efficient. They’re also interested in models that can create more than just proteins, such as nucleic acids, molecules that direct the process of protein synthesis in the body.
The field is advancing so rapidly that more breakthroughs could be around the corner, according to Trippe. He said just five years ago, the idea of using deep learning for protein design was mostly a dream.
“It’s only within the last year or so that computationally designed proteins have been possible in a useful way, and useful AI tools even more recently than that,” he said.
Julian Smith is a contributing writer. He is the executive editor Atellan Media and author of Aloha Rodeo and Smokejumper published by HarperCollins. He writes about green tech, sustainability, adventure, culture and history.
© 2023 Nutanix, Inc. All rights reserved. For additional legal information, please go here.