Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Friday, March 28, 2025

Simulating 500 million years of evolution with a language model

Hayes, T., Rao, R., et al. (2025).
Science.

Abstract

More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained at scale on evolutionary data can generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to alignment to improve its fidelity. We have prompted ESM3 to generate fluorescent proteins. Among the generations that we synthesized, we found a bright fluorescent protein at a far distance (58% sequence identity) from known fluorescent proteins, which we estimate is equivalent to simulating five hundred million years of evolution.


Here are some thoughts:

A groundbreaking advancement in evolutionary biology and artificial intelligence has emerged with the development of ESM3, a cutting-edge multimodal generative language model capable of simulating the evolution of proteins over hundreds of millions of years. ESM3 leverages principles of language modeling to reason across the sequence, structure, and function of proteins, enabling the creation of novel proteins with unprecedented diversity and functionality. This innovation is built on scalable architecture, utilizing 98 billion parameters trained on billions of protein sequences and structures. Through this extensive training, ESM3 generates proteins that align with complex biological prompts, uncovering regions of protein design previously unexplored by natural evolution.

Among its remarkable achievements, ESM3 successfully created a novel fluorescent protein named esmGFP, which is evolutionarily distinct from known proteins, effectively simulating over 500 million years of natural evolutionary progress. Using token-based training, ESM3 predicts and generates protein sequences and structures with extraordinary fidelity to natural patterns. The model’s iterative fine-tuning process enhances its biological alignment, improving its ability to solve intricate design challenges such as ligand binding and tertiary coordination tasks. Moreover, ESM3 enables programmable control, offering scientists the ability to design proteins with specified traits, such as fluorescence, while maintaining their functional integrity.

This innovative approach holds transformative potential for biotechnology, facilitating the rapid design of proteins for applications ranging from medicine to materials science. ESM3’s ability to simulate and surpass the constraints of natural evolution marks a new frontier in computational biology, driven by the synergy of artificial intelligence and evolutionary science. By unlocking new possibilities in protein design, ESM3 is poised to redefine the boundaries of what is achievable in both theoretical and applied biosciences.