Just as ChatGPT generates text by predicting the word that is likely to be the next word in a sequence, a new artificial intelligence (AI) model can write new proteins from scratch that do not occur naturally.
Scientists used the new model, ESM3, to create a novel fluorescent protein that shares only 58% of its sequence with naturally occurring fluorescent proteins, they reported in a study published July 2 in the preprint bioRxiv databaseRepresentatives from EvolutionaryScale, a company founded by former Meta researchers, also provided details in a June 25 rack.
The research team has a small version of the model under a non-commercial license and will make the large version of the model available to commercial researchers. According to EvolutionaryScale, the technology could be useful in fields ranging from drug discovery to designing new chemicals for plastic degradation.
ESM3 is a large language model (LLM) similar to OpenAI’s GPT-4 that powers the ChatGPT chatbot, and the scientists trained their largest version on 2.78 billion proteins. For each protein, they extracted information about its sequence (the order of amino acid building blocks that make up the protein), structure (the three-dimensional folded shape of the protein), and function (what the protein does). They randomly masked out bits of information about these proteins and asked ESM3 to predict the missing pieces.
They scaled this model based on research the same team did when it was still at Meta. In 2022, they announced EMSFold —a precursor to ESM3 that predicted unknown microbial protein structures. That year, Alphabet’s DeepSpirit Also predicted protein structures for 200 million proteins.
Related: DeepMind’s AI program AlphaFold3 can predict the structure of every protein in the universe — and show how they function
Scientists later pointed out that there limitations to the predictions of these AI models and that the protein predictions need to be verified. But the methods can still greatly speed up the search for protein structures, because the alternative is to use X-rays to map protein structures one by one — which is slow and expensive.
But ESM3 goes beyond just predicting existing proteins. Using information gathered from 771 billion unique pieces of structure, function, and sequence information, the model can generate new proteins with specific functions. It was described by one of EvolutionaryScale’s backers as a “ChatGPT moment for biology.”
In the new study, the researchers interrogated the model to generate a new fluorescent protein — a type of protein that captures light and returns it at a longer wavelength, causing it to shine in a new shade of green. These proteins are important to biological researchers who add them to molecules they want to study in order to track and image them; their discovery and development yielded a Nobel Prize for Chemistry in 2008.
The model generated 96 proteins with sequences and structures likely to produce fluorescence. The researchers then chose one with the fewest sequences that matched naturally fluorescent proteins. Although this protein was 50 times dimmer than naturally green fluorescent proteins, ESM3 generated another iteration that resulted in new sequences that increased the brightness — and the result was a green fluorescent protein unlike any other found in nature, dubbed “esmGPF.” These iterations, performed in record time by the AI, would have taken 500 million years of evolution to achieve, the EvolutionaryScale team estimated.
“We currently lack the fundamental understanding of how proteins, especially those “new to science,” behave when introduced into a living system, but this is a cool new step that allows us to approach synthetic biology in a new way. AI modeling like ESM3 will enable the discovery of new proteins that the constraints of natural selection would never allow, enabling innovations in protein engineering that evolution cannot. That’s exciting. However, the claim that 500 million years of evolution can be simulated only focuses on individual proteins, which fails to account for the many stages of natural selection that create the diversity of life we see today. AI-driven protein engineering is intriguing, but I can’t help but feel that we may be overconfident in assuming that we can outsmart the intricate processes that have been fine-tuned by millions of years of natural selection.”