Biohub Is Turning Evolution Into a Search Engine for Protein Design
"approaching programmable biology"
Watch the recap video here
Recap
- 0:00-2:08 - : Rives frames ESMC as a step toward programmable biology and connects the thesis to the bitter lesson: scale and search can beat hand-built rules.
- 2:09-8:06 - : Protein sequences become learnable data; the release combines ESMC, ESMFold2, and ESM Atlas.
- 8:13-16:02 - : Biohub clusters sequences and uses model features to connect distant protein families, including motifs such as the nucleophilic elbow.
- 16:03-25:17 - : Metagenomic data restores scaling by adding noisy but diverse genetic material from environments like soil, ocean, vents, polar regions, and the human gut.
- 25:18-33:11 - : ESMC is contrasted with AlphaFold-style methods and applied to antibody and binder design.
- 33:12-54:20 - : The loop shifts from digital prediction to experimental validation, open-science tooling, and possible virtual cell models.
- 54:21-1:10:07 - : The close returns to bottlenecks: generated data, scalable measurement, compute, and wet-lab feedback.
Context
This episode was published by Latent Space on May 27, 2026, alongside Biohub's ESM release. The guest, Alex Rives, is Head of Science at Biohub and previously led major ESM work.
The release has three pieces. ESMC is the protein language model. ESMFold2 is the structure-prediction system built on ESM representations. ESM Atlas is the map: Biohub says it covers 6.8 billion protein sequences and 1.1 billion predicted structures. Biohub describes the tools as open and MIT licensed.
Protein language models train on amino-acid sequences rather than human text. Those sequences contain information about physical and biological constraints: which residues tend to coexist, what folds are plausible, and what sequence changes might preserve or break function.
Biohub is using the release to push a general operating loop for biology: train on evolution's archive, search the learned model for useful designs, test them experimentally, then use the results to improve the next model.
Technical Need To Know
- Protein language model: A model trained on amino-acid sequences rather than human text.
- ESM and ESMC: Evolutionary Scale Modeling; ESMC is the Biohub model family discussed here.
- ESMFold2: A structure-prediction model built from ESM representations.
- ESM Atlas: Biohub's large map of protein space, described as covering 6.8 billion sequences and predicted structures for 1.1 billion representatives.
- Metagenomics: Sequencing genetic material from environmental samples to add biological diversity.
- MSA: Multiple sequence alignment, a method for comparing related proteins; antibody regions can be harder for MSA-heavy methods because they are selected for diversity.
- Lab-in-the-loop validation: Model proposals are tested experimentally and fed back into the next round.
What Folks Are Saying
- Outside coverage is early and mostly release-adjacent. Axios framed Biohub's release as a protein world model that could compress some research timelines while emphasizing that therapeutic use still requires safety testing. Nature News described it as an open-source model predicting shapes for roughly 1 billion proteins and reported Biohub's claims on structure tasks. Latent Space's own framing is enthusiastic but should be read as release narrative, not consensus.
Nuanced Take
Rives argues for scale while still leaving room for scientific priors, structure data, and experimental validation. ESMFold2 remains a structure-prediction system, and Biohub's public model cards keep validation requirements explicit.
The work shifts from encoding every biological assumption by hand toward training large representations on evolution's archive, searching protein space with those representations, and spending scarce experimental capacity on the data and hypotheses most worth testing.
Antibodies are the key test. If ESMC-derived representations help design scFvs and other binders in wet-lab assays, open protein foundation models may move the bottleneck from model architecture toward assays, robotics, measurement, safety testing, and the speed of experimental feedback.
A model can propose a binder. It cannot, by itself, prove the binder is safe, manufacturable, or clinically useful. The release points to a new allocation of work in AI biology; it does not remove the hard parts of biology.