Riskgaming

Biology is becoming engineering and not just science

Description

During a recent interview, Nvidia CEO Jensen Huang emphasized his interest in how Nvidia’s AI processing chips could transform the science of life. He noted that this science, when properly understood, could evolve into a new form of engineering. Currently though, we lack the knowledge of how the extreme complexity of biology works, nor do we have models — namely AI models — to process that complexity.

We may not have a perfect understanding of biology, but our toolset has expanded dramatically over the past ten years. Now, with the combination of data, biology and AI, we’re seeing the early signs of a golden era of biological progress, with large-language models that are able to predict everything from protein folding to increasingly, protein function. Entire spaces of our map are being discovered and filled in, and that is leading to some bullish scientists and investors to call the period we are living in the century of biology. But much remains to be done, and that’s the topic of our episode today.

Host ⁠Danny Crichton⁠ is joined by ⁠Lux Capital⁠’s bio investor ⁠Tess van Stekelenburg⁠. Tess and Danny talk about Nvidia’s recent forays into biology as well as the new foundational model ⁠Evo⁠ from the ⁠Arc Institute⁠. They then look at what new datasets are entering biology and where the gaps remain in our global quest to engineer life. Finally, they’ll project forward on where evolution might be taking us in the future once unshackled by nature.

Transcript

This is a human-generated transcript, however, it has not been verified for accuracy.

Danny Crichton:
I think one of the most interesting things that is going on in the world today is obviously NVIDIA just crossed $2 trillion on the market cap scale. One of the all-time biggest companies in the world. There's only two other tech companies that have crossed this. And then Saudi Aramco is up there somewhere there, that classic oil is the new data in terms of resources. But one of the things that he, the CEO, of NVIDIA said recently was he emphasized how important biological research was to the future of NVIDIA. And it was striking, at least to me, and I think for you, Tess, as well, we talk about foundation models, we talk about text, we talk about video and the future of screenwriting in Hollywood and all these creative industries. And yet the person at the center of this whole AI world is thinking about the foundations of the biological world. And maybe that's not surprising because recently there was news, and you're going to be better at summarizing it than me, of this launch of a new foundation model called Evo, which is really kind of taken the bio world by storm.

Tess van Stekelenburg:
One of the things that he says in particular was biology is the opportunity to become engineering and not just science. That basically means that the moment it becomes engineering, it's not just R&D anymore, you're actually going on and it's predictable and it could start compounding on everything that happened the previous years. It starts exponentially improving. And so after AlphaFold2, I think a big thing that we started seeing is just this increase in the number of biological foundation models. Evo, in this long journey of what we've been seeing since AlphaFold over the past 2, 3, 4 years is a long context DNA foundation model. It was released by the ARC Institutes and the Stanford Center for Foundation Models. And one of the reasons why it was so interesting is because it actually required a change in the architecture.
DNAs is not just the protein coding regions which code for the functions that you care about, but also a lot of regulatory regions that tell you when, where, and how to actually produce that protein so it has all the instructions for a cell, which is why almost every cell in your body is the same DNA, but they all look very different. And that all comes down to how the proteins are read and when and where. And so unlike proteins over much longer contexts and a lot of the transform architectures that we had could only take in shorter context. So that was a big breakthrough that we can dive into.

Danny Crichton:
I thought it was interesting. I think it was about 7 billion parameters, if I remember correctly, one of the largest DNA models we've ever seen. But I think one of the challenges for me as someone who's interested in bio, covered it, studied as an undergrad, I'm overwhelmed by the amount of information that's coming out of the bio world. So to me, there was this crossover function of AlphaFold from Google's DeepMind division a couple of years ago that was really, I think it was science's breakthrough of the year. It was this huge qualitative shift where we went from not really knowing how to fold proteins, people probably remember all the different programs you could do protein folding at home, and all of a sudden they sort of solved the problem.
We were able to predict with pretty high accuracy from a DNA strand what a protein would look like. But since then there have just been dozens and dozens and dozens of foundation models, each of which have their own features. So for instance, this one talks about long context. You can have a bigger context window than other foundation models. Why does that matter in the context? Why is that an innovation compared to what's come before?

Tess van Stekelenburg:
The typical transformer architecture was limited in the amount of context that it could take in. And so genomes are very large. Evo can take into context like the 131,000 tokens. A lot of the data that's encoded is kind of these long range interactions between one part of the genome which will have the regulatory sequences and then a part that's a lot further down. And so when you have shorter context windows, you're chopping all of that sequence data up and you're losing a lot of the relevant information that tells you, for example, how to make a CRISPR system. As a result, when you increase the context window, you're able to start going after these system-wide interactions and can actually access the design of more sophisticated biological functions.
I actually got a text or multiple messages from people saying, "Okay, wow, is this the holy grill? Have we solved synthetic biology and this is it?" And I think the interesting thing was like, no, this was trained on prokaryote data, which is basically a lot of the bacteria and microbes that we see around everywhere. So it was not trained on human genomes, it will not be as therapeutically relevant. The things we're finding are really the scalpels of evolution. So the tools, the scissors, and so we are finding tools with this, but it's still not the holy grail of biology.

Danny Crichton:
And just as quick context before we move on, obviously we had COVID-19 over the last couple of years. COVID is about 30,000 base pairs, so it's context window if you will, 30,000. This one with Evo is 130,000. So it is able to actually download the entire DNA of COVID and do something with it if it wanted to, as one example of trying to give some relative factor. But to go to your elementary tools, if you will, of biology, I think what has changed in the last 10 years, and we're almost on the exact kind of anniversary of the discovery of CRISPR. We're coming up on the decade kind of anniversary, give me a plus or minus. One of the interesting things here is if you think about DNA, we sort of discovered it 50 years ago. We discovered the building blocks with CRISPR and CRISPR-Cas9 about 10 years ago.
And then now we're really starting to accelerate, at least in my view, our understanding of all of the ways in which DNA operates on itself. So it self-repairs, it replicates, it, copies interacts with other pieces of DNA. So we've learned a lot about epigenetics, this idea that DNA can regulate itself and getting that kind of complex system science down to a calculable kind of number. And that's where I think AI has been so interesting is we're filling in a lot of the blank areas where scientists have just been kind of groping for a tool in the darkness, almost like what mathematicians would call the drunkard search. We're always looking at the light and we're trying to find our keys and we're just like, well, now we're over here. But now we are actually getting these core tools.
And so yes, it is just prokaryotes. Yes, it's not human cells that we're getting this DNA on. No, we can't get a therapeutic right away, but we're actually getting something much more important, which is we're getting the base layer of biological sciences. So if you really want to go back to Huang, CEO of NVIDIA's comments about turning biology into engineering, this is really the first step, which is, you have to actually understand how biology functions. And up until the last decade or two, we really didn't understand the full system. We had access to parts of it, but we're starting to get the fuller picture.

Tess van Stekelenburg:
I mean, I completely agree. I think the biggest shift that's happened is we had sequencing and everyone's heard of these exponential curves that have been happening in sequencing costs going down, and as a result, these databases are just growing. That's databases across genomes, which is what this Evo model has been trained on, that's metagenomics databases, which look at proteins. The way that we would look at a lot of these tools and try to find them before was very much like manual search. You try to see, okay, I know that this particular sequence has this function, maybe it cuts something, maybe it binds somewhere. I'm now going to go put this sequence and search all of the databases to find something that's similar. What we've actually been finding out with a lot of these language models being applied to biological data is that the embeddings that they create might be actual better search mechanisms than just sequence alignment.
So you could have something that has almost no sequence homology, but has deep functional or structural conservation just because it's been evolved and had that pressure to keep that structure or keep that function through a variety of different sequences. So it's actually proving out to be a better way to search for new tools. So yeah, it's completely changed the paradigm away from manually looking through lines in a database to see what matches and doing it based on what its predicted function would be or what its predicted structure would be and how similar those are.

Danny Crichton:
And what you're getting at it a little bit is the biggest translation that is hard to do in biology is we start with this core DNA, these ATCGs, the classic code that is the code of life, 50-year-old discovery. But just because we have those letters, we actually don't know what they do. We have to actually translate those into proteins. We have to actually see how those proteins fold so they become amino acids. Those amino acids are proteins. They actually shape based on biochemical and biophysical properties to actually fold a certain way, so what they actually turn into a shape. And then they interact with this very complex biological system called the human body or whatever organism we're studying, and we have to figure out how they affect other proteins and other molecules within the system. So how they bind, where can they bind, and that's the puzzle pieces that we're all trying to do.
To me, what's interesting here is I do think that we're getting this core firm around the DNA layer, so we're really understanding the core tools. We're understanding how to cut it, we're understanding how to edit it, we're understanding basically the right function. If we think of sequencing as reading, we've moved to having a lot of sophistication around how the body writes it and how we can write it as human beings intentionally trying to do stuff. But the challenge, at least from my perspective, and I'd be curious your thoughts here, is how does that get translated into actual proteins? Because that is the actual core engine of how a life form works. We're not built around DNA, so to speak. We're built around proteins which come from DNA, and it's that translation that I think is so hard and where so much more work has to be done in order to get to the kind of therapeutics that people want at the end of the lost arc, if you will, in this journey.

Tess van Stekelenburg:
Yeah. Well, I mean the way that these models are developing is almost at each level, I would call them biological modalities where you have DNA, RNA proteins and then maybe protein, small molecule metabolites. Currently, we don't have an all-compassing model where all of these are being integrated and we're able to switch between layers and really understand the interactivity. I think that's somewhat of a pipe dream down the road.

Danny Crichton:
And why is that? Why don't we have a magic model?

Tess van Stekelenburg:
Because things break down. A lot of the outputs of these models are just predictions that might not even be physically viable. And so they can give us an approximation of what its function might be or what its structure might be, but it's not the ground truth. And so we need to get to a point where at least the predictive power gets even better. It's a very complex system. If you get down into the nitty-gritty, there's actual biophysics at some point of where enzymes move and how much they take up space. And if one catalyzes a particular substrate and has a product, that product might inhibit something else from being catalyzed.
And so just because you could predict one reaction doesn't mean that you can predict all of the different downstream ramifications of how that product might bump into protein and change its course and cause it to catalyze something else. But we will be seeing a DNA model that's pre-trained on genome sequences, interact with a model that's pre-trained on RNA, and then use those two to maybe design better CRISPR-Cas9 systems or optimize the guide RNA. I think the way that we'll see this is those models interacting with each other rather than one big holistic model at the start.

Danny Crichton:
Well, and when I think about how these models come together, the big benefit over the last couple of years has been this just massive increase in data around biology. So you talked about the exponential decrease in sequencing. That exponential decrease in the cost of sequencing means that we have so much more sequencing data available to us that we can actually train these models really, really well. The same thing is true in protein folding. We were able to calculate, I think thousands and thousands of proteins. Now we have AlphaFold, which comes out with hundreds of thousands of proteins, and now we're going back and saying, okay, here's all these predictions. Which ones did it get right? Got a lot of them right, I think, but it's not perfect as you've pointed out. And so we're able to improve this data over time. Where are there gaps today in the data? Where if only we have more sequencing data like this AI model would exist, but because we don't either because the sequencing or the testing or the ability to actually build that data set is very expensive. We just don't have access to it.

Tess van Stekelenburg:
Definitely on this level of protein functional screens is where the number of screens that we can do, given the diversity of proteins that exist, is just massive. And that's where the cost comes in, 'cause it's not as standardized. If you want to understand it's thermod, stability, or it's binding property or if it's going to catalyze a particular reaction. All of these are independent functions and proteins do anything from transduce signals into our eye that are into neuronal action potentials. They digest food that we eat. They also are causing fluorescence in algae in the sea. So just all of the functions that we see that biology does is being done by proteins and to go screen what a protein might do as a function that is very costly. But I would say, yeah, proteins and their functions is definitely the largest area where there's just a diversity of data sets, which makes it hard to scale.

Danny Crichton:
And the complexity here is both the proteins can do multiple things. They interact with each other, so those interactions change with each other. They can co-regulate each other, so the ability to express different parts can be controlled by other aspects of the biological system. And so all those layers add to a level of complexity where you just need more data to create the kind of... I'm coming from the stats world, but the statistical power to actually be able to make a proper prediction requires so much more information than these other layers. And so that keeps that section of biology much more proprietary, much more controlled. That's probably maybe the application layer, so to speak, where the most value will be created because that is where there's more molar defensibility than others.

Tess van Stekelenburg:
Yeah, a hundred percent.

Danny Crichton:
So that was the input question. So obviously we had way more DNA sequencing, but then we end up with protein functions, interactions where it's much harder. But let's talk about outputs because we've talked about this quite a few times where we keep talking about interpretations, we keep talking about predictions, but a prediction is not the same thing as truth. And that has been kind of the core debate I think across most artificial intelligence and machine learning is you ask a question in ChatGPT, it's almost always right about basic questions. You can ask, what's the capital of the United States? It will almost certainly tell you Washington DC unless you've done something to really mess it up, you've had a lot of discussions with-

Tess van Stekelenburg:
America does not exist in my custom instruction.

Danny Crichton:
Right, right. It's possible you don't reset the model or whatever the case may be. But as we get to outputs and we're starting to talk about interpretations, how useful are these interpretations? Is it helping us? Does that just help us kind of guide us? It's giving us a map of the region in that old expression of science. You're looking for maps or what's the old line about map? It's not the model, model's not real life, et cetera. But I'm curious how much is that helping us kind of move biological sciences forward? Is it helping us move forward linearly exponentially, or is it actually slowing down our performance as we're sort of overwhelmed with so much, oh my God, we now have... AlphaFold just brought out a million different things to look at. My God, how do we even begin to comprehend all the stuff we have to process now?

Tess van Stekelenburg:
Yeah, I would maybe say that a couple years ago, it was definitely in that stage where it was incomprehensible. We had sequencing, which is great that the cost curve is going down, but one thing you could say is that we did genomics wrong. So we were looking at single nucleotide polymorphisms and candidate genes, but the real statistical power only came once we were able to put these biobanks into transform architectures. So I don't think as a human we can fully understand all the relationships and covariances that exist. We have seen that when we use transformers, it's able to give outputs that do not require us to actually understand the patterns in the data. I actually think it gives us patterns that we might not as a human have been able to interpret that exist there, that are more relevant than just looking at a single nucleotide polymorphism or mutation, which is how we've done it in the past.
So I think it's improved our ability to use and design and predict, but as a result, we understand less why that's the case. If I have a new protein sequence that has been given to me by a model that I'm using in my browser, which I've designed because I want it to catalyze or bind to something, I might not understand what the actual dynamics were that enabled it to bind better. But I just know that it could be a better prediction than anything that I would've come up with 10 years ago. And so this is maybe going into area where I'm seeing a lot happening right now, prediction and then design space that is starting to open up as a green field with a lot of these models coming out. And so that's both seeing a lot of hungry entrepreneurs go and build companies on how we actually enable biologists to access many of these models, chain different pieces together, as well as just the programmatic access that has come from these models being released as open source.
As a result, there's a whole breakthrough in the accessibility of these models, what it enables in terms of biologists being able to design different sequences or proteins in much shorter timeframes and also get to answers in a very short period of time. Just to add an example to that, I bought a sequencer. It's like a very small nanopore sequencer. It was $600 or $800, and I started using it on the weekends. And I went from a Lambda phage, like FASTA file to a visualized structure of a DNA packaging proteins using a browser-based tool and ChatGPT in like eight minutes. And this is Tess van Stekelenburg who hasn't touched sequencing in a while. Being able to do that, I think that that is remarkable. And I think that's something that definitely would not have been possible two years ago. It would not have been possible three years ago.

Danny Crichton:
So it's still a black box. So when I think about an engineering, and we're talking about this kind of conversion of biological sciences from a science into an engineering principle, it's not going to be civil engineering where you have physics, you have statics, you have concrete, you can build a bridge, you understand exactly how it all works. You can understand the geology, you can understand the land, the wind forces, and you can have a perfect system. You can actually understand every aspect of this. And when we make mistakes like the famous bridge in Washington state that collapsed, we learn more about the modeling. And we've gotten so good at it that it's actually exceptionally rare for things to fail. And there are usually specific human reasons why that that happens. In biology, the engineering is going to look very different because we are relying on these tools that are black boxes.
We have a sense that they generally work, they will come back with the correct answer in most cases. In other cases, they can prioritize. So maybe we don't know what the right answer is, but it's one of the 20. Instead of looking at a million, a set of a million different proteins or whatever the case may be, one of these 20 is the right answer. And you've just taken this massive project down to a small amount of work that a humans can actually do. But I think as we start to think about engineering, biological life, entering new a stage of drug discovery and biological advancement, we are going to have to get comfortable with the idea that we understand the basic principles here. We understand how it all connects, but ultimately we are still reliant on these software AI models that are figuring out kind of the details for us. And we are going to know what happens below them and we're going to know what happens above them. But what happens in the middle is a huge open question.
And at least in my opinion, I don't think we're actually going to have an answer. I don't think we're going to have that answer in a short period of time. The good news is we kind of don't need it. And in the same way that we haven't understood biology for thousands of years, we don't necessarily have to have every molecule in the body figured out to be able to solve real challenging biological problems. And that to me is sort of the arbitrage opportunity right now is we've created a new set of capabilities that allows us to do engineering, even though we don't have every fundamental figured out. And that's a very tight distinction and nuanced distinction, but to me is the part that is the hardest to understand. And once you sort of grasp it, tech bio, all the excitement around AI/ML in biology really opens up of what the potential possibility is here over the next decade. And in some cases people call it the century of bio, who are really bold and ambitious.

Tess van Stekelenburg:
There is one approach where you could say it learns something about evolution when you traded all these sequences, right? It is learning some type of pattern on how these sequences have evolved and what their function might be. And so it could just be extracting a lot of that latent data on what evolution is, which for us is just a principle of a mutation confers a fitness advantage you selected for, but we don't have the simulations to do that. And so this could be an approximation where those are some of the patterns. It's learning, but it's going to be hard to fully understand that. I think where I get really excited is being able to break out of evolution.
So opening up the design space beyond what we've seen and beyond all of the samples that have existed and really going into the possible combinations that lead to a function that we care about that are not yet existing on Earth. And if those are physically viable, I think it allows us to actually accelerate effective evolution without having to wait for a mutation to pass on to the next generation. So I think that's where I get very excited about what some of the patterns that these models are able to do.

Danny Crichton:
I'm going to create my own central dogma, my own synthesis, which is we're going to intelligently design evolution going forward, and we're going to end one of the great cultural debates of the 20th century by intelligently design the future of evolution. But we have so much more to talk about on this subject. And Tess, I know you are going to come back on multiple episodes, but that is central dogma, that is what we're talking about today on the frontiers of biological sciences. That's what we're looking at every single day. And it does seem like there's a new model getting released every week, maybe every day, not every hour. That's probably exactly [inaudible 00:22:23]. There's not enough NVIDIA compute chips in the world to get there right now, but we are getting close. They're coming fast and furious, Fast and Furious Part 12, Bio Edition. But nonetheless, Tess van Stekelenburg, thank you so much for joining us.

Tess van Stekelenburg:
Thank you.

continue
listening