DNA is a four-letter text, which makes sequence design a next-token problem, but only with the right data and a tight feedback loop.
What is DNA? It’s text. The alphabet of DNA has four letters and they keep permutating, so it’s a perfect problem for an LLM, if it’s trained on the right data. The general models are only okay-ish for DNA. There’s one trained specifically on DNA, Evo 2 by the Arc Institute, the best out there, but still 99% of what comes out is useless. So we built a proprietary Oracle to predict what is not useless, feed it to the wet lab, and the outliers go back into the model. Every spin of the wheel, it gets smarter.
Source: Tomorrow’s Medicine, with Cyriac Roeding (eCornell Keynote)