Andrew White | Building an AI Scientist to Automate Discovery

September 10, 2025

about the episode

What if we could build an AI that doesn't just answer questions, but makes fundamental scientific discoveries on its own? That's the mission of Future House, and in this episode, host Allison Duettmann sits down with its co-founder, Andrew White.

Andrew shares the incredible journey that led him from chemical engineering to the forefront of the AI for Science revolution. He gives us a look under the hood at Future House's flock of specialized AI agents, like Crow, Finch, and Owl, and reveals how they recently accomplished in just three weeks what could have taken years: identifying an existing drug as a potential new treatment for a common cause of blindness.

But the conversation doesn't stop at the successes. Andrew offers a sharp critique of the current methods for evaluating AI, explaining what’s wrong with benchmarks like "Humanity's Last Exam" and why the ultimate test is real-world discovery. He also makes a compelling case for completely reinventing the slow and inefficient scientific publishing system for an era where machines are both the producers and consumers of research.

Andrew is also fundraising for the Frontiers Society at IPAM to advance this work. If you’d like to support, you can donate here: IPAM Donation Page.

About Xhope scenario

Xhope scenario

No items found.

Transcript

Allison: Hi everyone, welcome to the Existential Hope podcast. I'm super happy to have Andrew White here. Andrew, you are the co-founder of Future House, have some other hats on as well, and are just a really amazing thinker in the AI for science space. I'm really excited to pick your brain.

Let's kick it off with a bit on your background. Your bio mentions over 50 peer-reviewed publications and books in all kinds of areas, covering LLMs, chemistry, AI, statistical mechanics, chemical engineering—you name it. How did all these interests converge? How did they make sense to you? And how did they ultimately influence you to co-found Future House?

Andrew: Yeah. All right. I'll start from the beginning. When I started my PhD, I really wanted to do work with computers, and I was in chemical engineering. I was a person that wanted to do a lot of things, so I told one of my professors that I really like chemistry, physics, engineering, and programming and I wanted to be able to do all of them. And he said, "Well, I guess chemical engineering is the only thing that kind of fits that."

So then when I went to do my PhD, I went into a lab that had basically done some work on a method called molecular dynamics. Molecular dynamics is a really popular method in biology and physical chemistry that tries to simulate atoms as single particles. It's a really cool method, and I'll talk more about it later. Anyway, I wanted to do that in a lab that was doing experiments. So I went to this lab that was working on biomaterials, and Shaoyi Jiang was my PhD adviser. We were trying to figure out how to model biomaterials that resist non-specific interactions.

These are things like, you want to put an electrode in a brain. So you want to build a brain-machine interface. The problem is that the body sees that as a wound and it tries to heal that wound by covering it up with something called a foreign body response. So we were designing materials that could kind of block that to make the body feel like it's not some sort of foreign object or some wound to heal. And we were going to try to do that with molecular dynamics. And that sounds insane, and it still sounds insane. It doesn't make any sense to try to simulate atoms and use that to design a biomaterial.

So I started off my PhD, and the only thing we could come up with was looking at water—the role of water in this—because we had this hypothesis that if the amount of water in these systems is similar to physiological levels, then maybe that would work well. Anyway, so we went down this path and I did a lot of simulations of water and a lot of bioinformatics work, like seeing what common patterns are on surfaces in the human body or in cells. And I kept getting stuck by this problem that what would happen in the lab had nothing to do with what a molecular simulation would do.

So throughout my academic journey, I kept trying to find ways to bridge first-principle simulations like molecular dynamics with experimental results. That was what my postdoc at Chicago was on, and then I formed my research lab at the University of Rochester on trying to bridge the gap between simulation and experiments.

Then in 2019, I did a sabbatical at UCLA at something called the Institute for Pure and Applied Mathematics, or IPAM. They brought together people like Yann LeCun and Frank Noé. We had this guy named Pat Rulli who did a lot of Google science work when Google science was a thing. And all these people were at this long, three-month program working on how to use machine learning in physical sciences. So people were working on physics, material science, fluid mechanics; I was working in chemical engineering; some people were working on drug discovery and protein structure. It was a really cool environment, and that gave me the time to pause and breathe.

Being an assistant professor is a very time-consuming process, and while at UCLA—I lived in L.A. with my wife and my son there—I was able to just take some time to absorb what's going on in the field. And I realized that machine learning really is a way to bridge any kind of computational modeling and experiments. I wrote a textbook on this at the time. I'm a very active learner, so rather than try to learn something by reading, I decided to write a book. It's probably not the best way. It was a living book that I iterated over for a number of years. In fact, I was working on a new version of it a couple of weeks ago.

Then I went back and started teaching a class on it, started working on machine learning and chemistry. Anyway, that's a very long explanation of how I got there, but that's sort of how I got to that point of being in the domain of machine learning and chemistry.

Then I started working on language models. During the pandemic, everyone I know kind of went a little crazy. I was starting to question whether molecules can be represented as a collection of bonds and atoms. It's kind of a funny thing about chemistry—a chemical bond is like a Platonic ideal. There's actually no definition of what a chemical bond is. So it made me start to question everything about chemistry. Again, I'm isolated in my house with nothing to do for months. What the pandemic makes you do.

Allison: Yes. Exactly.

Andrew: So I basically was like, how do we reframe chemistry? And what I came up with was using natural language, because my idea was that the only thing that really happens in a lab is the actions you take to make something, and the best way to represent actions you take in a lab is natural language. So I had this hypothesis that natural language is the way to represent chemistry. And that was around the same time language models were becoming popular, which was really convenient because you could actually test some of these ideas. That led to some really cool work that ended up with me working on GPT-4 with OpenAI on doing red teaming.

But anyway, I do want to mention that IPAM at UCLA is a really great program. They're doing a program this fall on applying machine learning principles and simulation to electrochemistry, and the program has been suspended because of the Trump administration withholding grants from UCLA. I personally donated money to IPAM, and I'm trying to work in the philanthropy space to see if anyone can help bridge that funding. But that was a really important program for my career, and it was very sad to see that it's now in danger.

Allison: Share a link, and we'll add it to the show notes. Perhaps we can get a few more dollars dedicated to that.

I mean, you've really worn so many hats already at that point, and we're not even yet at the Future House stage. So, Future House to me is one of the more exciting projects in the Bay Area, and not only because your office is possibly the coolest office that an organization has ever had in the Bay, but also because the people that are working there and the mission are just so timely and are really dedicated to driving AI for science projects. So could you just explain what Future House does, and what is all the bird language of crows, falcons, owls, and phoenixes about?

Andrew: Yes. Okay. So, Future House's mission is to build an AI scientist. I think our mission changes on my level of optimism for the week, but basically, our mission is either to build an AI scientist, automate scientific discovery, or accelerate scientific discovery. Those are the sort of the three tiers. I would say our mission is successful when we have materially accelerated anyone's process of scientific discovery. And so, we build tools to do that. Obviously, we want to build an AI scientist, but we're trying to build it in phases. We're building what I would call AI assistants—things that can automate literature search, data analysis, and molecular design. But of course, scientific discovery is bigger than that.

Future House is a nonprofit co-founded by me and Sam Rodrigues two years ago. Sam is the CEO. I'm Head of Science, which is this nebulous title that means I want to do hands-on work, but I also want to have some level of leadership in the org.

And yeah, our mascot is the crow. This came out of a paper I wrote with a group in Rochester and Switzerland. This was about automating the process of molecular design, called ChemCrow. At the time, there was this idea of language models as a "stochastic parrot"—it's just repeating stuff. And it turns out that crows also can just parrot back words. Crows can talk. So you can go look up videos on YouTube of crows talking. They can say "hello" or "nevermore" or whatever. But they use tools, right? And so crows have language, they can talk like a language model, but they use tools. So we've been building these sorts of agents that can speak in English and natural language but also use tools, and they're maybe more intelligent than a parrot. Although, of course, after we did this, someone found a video on YouTube of a parrot using tools, and so then it kind of crashed my whole worldview. So I choose to ignore that fact.

That was the kind of thing. And then, why do we have Falcon and Crow and Owl as some of our agents? Basically, we suck at naming things, and I don't know if anyone has ever learned how to name things correctly in AI. You can look at the AI companies' naming of things and see that it's a very hard thing to do. We wanted to come up with a way to separate the models that are behind the things we build from the product—or I shouldn't call it a product, you know, the agent that people use.

So, based on the idea that Crow was our first thing, we started using bird names for how you interact with the system to separate it from something like PaperQA. Crow is the literature search agent, and then PaperQA is the algorithm behind it. Or Phoenix is our chemistry one, and ChemCrow is the thing behind it. We have a protein design one called ProteinCrow, but we haven't given it a bird name. I think "Potoo" is my current thesis. Have you guys seen a Potoo? It's like a terrifying, crazy-looking, hilariously looking bird, which I feel represents the process of protein design. So anyway, that was a whole bunch of information, but that's kind of Future House.

Allison: Could you provide a little bit more info on what these various different birds do for someone who has not used any of them yet? I mean, if anyone hasn't used them yet, I would just go to the website and start playing around with a few. They're really quite phenomenal. But yeah, give us a little bit of a look under the hood. What are they useful for? What have they done? What have these birds achieved so far?

Andrew: Yes. So, Future House is 25-30 people right now. And we're building these agents which are basically a collection of tools in an environment. And they're focused around some kind of problem type.

So Crow will do a literature search. You can ask it something like, "Make a list of all of the molecular degrader drugs that have been reported in literature." It'll go through and read papers, see who cited this paper, is it from a predatory journal or not? It'll do Google Scholar searches just like you would do. It'll read through papers page by page, build up a dossier of evidence of how it can answer the question. It'll try to propose an answer and think about it. So, it's really similar to how a researcher would go through it. So, that's Crow.

We have something called Finch, which is a data analysis agent. This is something that basically tries to build a Jupyter notebook to explore a data set. So, you upload 20-25 files. They can all be 5 gigs each. And then you give it some kind of task. You know, "Look at this data set and make an interesting plot." Okay, that's the weakest thing you could ask it, but that's an example. It'll go through and build a Jupyter notebook cell by cell. And at the end, you'll get a Jupyter notebook that you can execute to repeat the analysis. It'll give you some plots and an answer.

And there's one that does precedent search: Has anybody done X? And this one can look through clinical trials, it can look through patents, it can look through previously published literature and see if anybody has tried this idea that you have before.

We have written some papers where we string these agents together to automate repurposing a drug to treat a disease. And they work really well for this. And I think what's really remarkable about these kinds of systems is the scale, right? They can entertain hundreds or thousands of hypotheses simultaneously. They can just operate at a very vast scale. And some of them are as good as humans. Some of them are better than humans. Some of them we haven't measured and we're not sure yet, like the data analysis one. We haven't really figured out what's the average human for data analysis yet. We're working on that right now. But they really just can basically do this sort of vast, scaled-up, horizontal discovery process.

Another part of our org is working on what I would call very sharp vertical models which are very intelligent at a specific task. I can talk more about that if you're interested, but that is sort of the most public-facing thing. A lot of these things are open source. We have an API that people can use to call them. So an API to do a literature search or to do a precedent search can be really enabling in your own work on building agents for discovery.

Allison: Yeah, recently I think you guys proclaimed something like, "a step towards superhuman insight," and that was a quite interesting and quite actionable scientific discovery potentially. Could you discuss that a little bit more?

Andrew: You know, one of the problems in my life, and I think many people's lives, is that you build tools and then you hope that someone will use the tools to do something great. And I have had problems with that in the past, and when we built Future House, we really wanted to have people inside using the tools to make discoveries because there's this chicken-and-egg problem. I'll build a tool and hope someone uses it, but maybe they use it for the wrong thing or something. So, we have a wet lab, a physical wet lab in Future House, and we have scientists that go in and try to use our tools to make discoveries.

Things have been going pretty well. We tried to build a process to discover a new treatment based on a new mechanism for a disease. We hired an ophthalmologist, Ali is his name, and he's an MD, and he helped us come up with a plan to treat dry AMD. Dry AMD is the most common cause of blindness in elderly people in the US, and I think internationally, and it's not really known how to treat this disease. There's not really any good approved drugs. There are things that can reduce some of the symptoms, like taking eye drops to reduce irritation, but you will go blind over time with this disease.

So we asked our system to come up with as many ideas as possible. It hypothesized about a lot of ideas about how this disease could be treated and then it used things like Owl to see if anybody had tried this before, and if so, was it good, was it bad? And then based on these ideas that it thought could be good, what are experiments we could do in our lab? So then we did those experiments, then we provided the raw data to Finch, our data analysis agent, and said, "Analyze this data and see how it looks." And then Finch made all the plots in the eventual paper we released on this. And then it went back and said, "Okay, here are the results from the experiments, here's what we learned, what should we do next?" And then it proposed specific compounds to try to treat this in the cell model.

We came up with a compound, Ripasudil, which is an existing drug for a different condition, and we found that it actually basically hit the mechanism that we proposed, which is accelerating how your macrophages can sort of clear debris out of the eyes. And that, we think, is a potential treatment for dry AMD.

I think the exciting part about this is we picked a goal like "treat AMD" and the system went all the way through to propose a drug and give us cell-based evidence for it. And it's a drug repurposing, so there's not a lot of clinical trials that have to be done. But then you have to go put it in dry AMD patients in humans at the end to prove that it will work. But it got so much farther than research programs that could take years. The whole project took maybe three weeks, and a lot of those three weeks was the actual experiments. The actual computational work took maybe 12 hours total. And so this is like a real demonstration that this approach can work.

There are a lot of caveats here, like we had to think about how people go about repurposing drugs. We had to make sure that we said things like, "Make sure you think first about the mechanism, then think about the drug." We had to say things like, "Our lab only has these pieces of equipment and these are the cell lines we could use." So there's still some caveats, but I think it really demonstrates and gives you a glimpse of where the future lies for this work.

Allison: I think you guys claimed that there's a path to superhuman insight. And on that note, you have also dug yourself deep into different benchmarks and how they fare versus not. One specifically I think that you guys recently published on was Humanity's Last Exam, which I'm sure many people that are listening to the podcast are familiar with. But maybe you could briefly outline that and what you think of the benchmarks of how we are currently measuring tasks of humans versus AIs, and then specifically what's wrong with Humanity's Last Exam.

Andrew: I was on a panel a couple of months ago and we were talking about how AI is impacting different domains. I was there representing science or something, and somebody was there representing medicine. They were saying that AI in medicine is something where it just makes no sense to build evals. Every patient is different, and in case reports, a doctor will pick up on things from talking to the patient that just can't really be represented. It's very rare that a doctor walks into the patient's room and all of the assays are done, all the lab results are back, there are written-up notes, and they just look at all of it and say, "Yes, you have a UTI," or something.

Their argument for AI in medicine is that you need to treat these systems like a resident. Basically, you walk around with it and you work with it for a few months or a year, and then you decide, "Yes, you are good enough to be an independent practitioner." And I think that at a certain point you just have to treat these like a trainee in your profession and evaluate them based on how well they can accomplish your job end-to-end. I think that's where we want to get to. Basically, you want to get to the point where you as an expert can work with these systems and be confident that it could do your job independently and say, "Great, you will go off and now be a doctor." Of course, the difference is that if an AI passes residency, you can copy that thing n times and now you've replaced a lot of medicine. So I think there are some consequential differences when you're done with the process. But I think the actual evaluation is maybe similar to how we evaluate humans.

Now let's bring it back. So then there are these evals, things like Humanity's Last Exam, which is an open-answer and multiple-choice test that measures how good AI systems are at doing what was characterized as frontier-level research. And I think this is great. What's really nice about a written test is you can repeatedly test it over and over again in different parts of the world with different systems. You can all agree that you're measuring the same thing. Whereas if you have a doctor or a person watching the system, you may not agree on what is important or what is not important. It's really hard to know that evaluation. I guess that's maybe the "vibes-based eval" in a more favorable light than these written tests.

But we run into this problem where people are trying to hit multiple objectives in building these evals. An issue is that when you build a new eval, it might turn out that you finish it and then the LLM gets 99% on it and suddenly you wasted all this time. So when you build an eval now, you have to think carefully about making sure that LLMs can't answer it all correctly. Humanity's Last Exam was built trying to hit a bunch of different objectives. One of them was that it was crowdsourced. Basically, you have people providing questions. They're not professional question writers. They're not even necessarily experts in the domain. They're basically providing questions. Then you have an assessment process where people say, "Okay, let's see if this is a good question."

And because we're at the frontier of human knowledge, the real assessment for Humanity's Last Exam was, "Can frontier LLMs get the question right or not?" And so there's this requirement that frontier LLMs can't get the question right. This leads you down to these questions which are either: A) there's not enough context in the question for anyone to get it right; B) it can be a truly frontier-level question, like a really hard math problem that they just can't get right; or C) it can be a kind of "gotcha" question where you're sort of trying to trick the person answering it.

The example we gave in our write-up was: as of 2002, what is the rarest noble gas among terrestrial matter? So basically, if you were to take Earth and grind it all up and characterize the elements that are on Earth in the noble gas category, what's the rarest of them? If you go read any papers on this topic, they have terrestrial accounting of noble gases. It's an important question. Some of them, like xenon, are just in the atmosphere; they're very rare. Some of them, like radon, basically come out of the earth from the decay of uranium. That one is very rare on Earth, and you can disagree on how much there is.

But the trick of this one was that in 2002 in Russia, some people did an experiment where they created a few atoms that are in the same column on the periodic table of noble gases. People argue about what a noble gas is. A noble gas, according to IUPAC, is something in this column of the periodic table of elements. Some people say a noble gas has to be a gas and has to be non-reactive. This thing is not a gas. It's a solid. It's very reactive. It decays immediately. And so it's a question of whether this is really part of terrestrial matter in 2002. I don't know.

Anyway, the point is that this question has nothing to do with whether you can do research and are at the frontier of your field. The question is on a game show; it's one of these trivia gotcha questions. What we did in analyzing the exam is we wanted to know if you take the question and the answer and you turn it into a statement. And so the statement now says, "Oganesson is the rarest noble gas among terrestrial matter in 2002." If you take that statement, then we check to see if has anybody written a paper that contradicts this statement. Because when they made the exam, they rightfully said it needs to be what they called a univocal answer—that any scientist, the average scientist, 90% of scientists would agree that this is a true statement.

And no scientist would agree that's a true statement. There are multiple papers contradicting this, saying that xenon is the rarest gas or radon is the rarest gas. Anyway, so we applied this standard: if you turn the question into a statement, how often is it contradicted by peer-reviewed literature? And it turns out a lot of them are. We hired independent contractors to look at this, and they evaluated a large number of them. We used our AI agents to basically find the papers that contradict them ahead of time so that we could accelerate this process. And we found that 30% were contradicted by the literature in a pretty clear way.

At the end of the day, this last exam was built according to the incentives of the question writers, and the incentive of the question writers was to make sure current LLMs get the questions wrong. Why do they get them wrong? Well, sometimes it was just a trick question, or they actually just got it completely wrong. For one of them, somebody used their source as a PowerPoint to write a question, and that PowerPoint presentation itself had a mistake, and so then the question had a mistake. But then the person reviewing the question read the PowerPoint as well because if you Google it, that's the most common return result.

This is just kind of a mess, and I think we're just running out of steam here where people can actually write questions that frontier LLMs get wrong because they're not smart enough, because we're just scraping the boundary of knowledge. And so when we think about what the right questions to work on now at Future House are, we just say, "Okay, how many fundamental new discoveries can our systems make?" And if that's the measure, every time we get a new discovery, we're advancing scientific knowledge and evaluating the systems at the same time. So that's where we want to be, but of course, we're not all the way there yet.

Allison: That was a mouthful to digest. If you are trying to apply that learning to the scientific ecosystem in general, what's there to learn? Because Humanity's Last Exam is still one of the standards out there that people are trying to get after.

Andrew: Yeah. I mean, this is where it gets hard. How do you agree that systems are getting better and better? I think you'll see competition-level mathematics becoming a more popular standard because we have a really good calibration of what is a hard math problem. You go to chemistry, and it's not really universally agreed upon what a hard chemistry problem is. Sometimes things are really hard in chemistry because it's not stable at room temperature or because you need some stereoselective catalyst and it doesn't exist yet. It's a very difficult field because it's grounded in the physical world.

So how do we measure stuff like that? Well, basically at Future House, we make evals. We made one called Bioinformatics Benchmark, which is basically where you give someone a hypothesis that you're trying to study in a data set and then you ask it if it can answer questions about this hypothesis. It's okay. I don't think it's the best eval we've ever built, but that one is one that frontier LLMs get like 20% on.

We're building another one right now which we're calling internally LitQA3, and this one is what I would call high-recall tasks. Right now, agents are really good at precision. Humanity's Last Exam is a very precision-level task: go find this one fact. We're trying to build evals that are what I would call recall tasks, like "find every paper that studied X, do a meta-analysis to find out if X is true or false, and then give me statistics on whether that's true or not." That is a great question because the current agents really don't do well at incorporating a whole bunch of information. So that's an eval we're working on right now.

But I think the idea that there's going to be one number, like MMLU Pro or HLE, that characterizes models—I think we might be leaving that era, and it's going to be a lot of really hard work. So maybe composite evals. I know Scale AI has a composite eval. I think Salesforce or something has a business intelligence composite eval. I don't know how those work. I have to look at those more carefully. But I think we need to build something like a science composite eval that looks at a whole bunch of different areas, or we just need to be like, "Okay, we're going to judge them like we judge researchers." And the eval is to write a paper on these 10 topics, and we see how people grade them. And that sucks. It's not a great eval because it's not repeatable and it's very subjective. But we might be reaching the point where these systems are so good that it's not really possible to write a multiple-choice set of questions that they look at.

Allison: As we're advancing more and especially as we're getting to this frontier of human knowledge, it's probably not going to get easier or more straightforward over time. When you think about the long-term goal of the AI for science field—the goal of automating scientific discovery and the R&D process in general—how do you see this ultimately benefiting the real world? For example, what diseases do you think we might be able to solve with the work that you and many other actors that are currently gearing up in the AI for science field are doing? Where's the maximum existential hope vision here in the next five years or so if everything goes right?

Andrew: Okay, so for diseases, I am of a generation of a whole bunch of very sad cynics. I've lived through the Human Genome Project where we thought if we could just look at the entire human genome, we could cure diseases. That didn't really pan out. Then there was high-throughput sequencing, high-throughput imaging, then transcriptomics and antibodies. There have been so many times when we thought we've basically cracked human biology, and it's really blocked us at every turn. So I'm a pessimist when it comes to curing all diseases.

Now, with that said, I have hopes that our approach is maybe one cognitive level higher than what has been done in the past. We're not trying to automate curing biology by building one really good viral delivery system or one really good modality or one really good way of looking at targets. We're trying to automate the process of scientific discovery with the understanding that there will be no one silver bullet. We're just going to try to make the bullets faster, I guess, or make more bullets faster, something like this.

So, I think we may be able to go after diseases that are more complex. Many diseases are very complex, and it may not be that there's one magic pill that solves them, and so you need to actually get a lot of understanding. Or maybe there is a magic pill for some complex disease. Maybe we'll find a magic pill for Parkinson's one day. But definitely right now, those diseases have so much research, so many papers, so much information that nobody is able to synthesize all of that and really pursue multiple hypotheses at once.

I think Alzheimer's is a great example of this. The people doing discovery are human, and in Alzheimer's research, it's such a complicated disease that human personalities and how much people could convince other people became a pretty big factor in what research directions were pursued. This led to a lot of people working on these amyloid-beta peptides as the causal agent in Alzheimer's research. And now we know from multiple failed clinical trials and cases of fraud where people were faking foundational results in the field, we know that that hypothesis is not really the most compelling one. And still, even today, I think people would rate it as the number one hypothesis among the community.

We're really scraping where a big community of researchers can actually cure a disease because it's just too much information for everyone. So everyone has to pick their sort of internal compass, and that internal compass that people have is influenced by the charisma of researchers. It's influenced by who's giving the keynote at the next conference or who the senior researchers are. So I have hopes that this kind of automation of discovery can basically provide a more unbiased pursuit of lots of hypotheses in complex diseases. But I don't think we've built a magic bullet. I think what we've built is an accelerator so that as we get new ideas, our agents can use them. RoseTTAFold 2 is a very new, exciting open-source model that's like AlphaFold 3 in designing small molecules. Our agents can use ESM, which is a very powerful protein language model that's used for manipulating protein designs. Every time these new tools come in, we can basically put them in our system quickly. And so we're able to accelerate the outer loop of science, which is where a PhD student has to learn how to use a method, has to learn how to install the code, and then try a couple of ideas. We're able to automate that outer loop. And so I hope we can accelerate progress on diseases there faster. So anyway, I don't know if I answered the question directly, but this is sort of how I'm thinking about the process.

Allison: And is that your maxed-out, full-throttle existential version of where we could be in five years? If you think about full-bottle existential hope, what's the kind of outer limit of where we could be in terms of AI for science in five years if not only your work goes well, but also that of various other actors entering the space? What's realistic but also optimistic?

Andrew: Okay. So, if I can use my magic wand, basically we would have a big overhaul at the FDA so that approvals are automated, so that trial design is done by some kind of machine system, so that we can accelerate that process. More tolerance of creative endpoints, right? So, a big overhaul of regulatory. If you want to have an impact in 5 years, we have to have a big overhaul on the regulatory part.

On the other side of things, some people are concerned about the speed of pharmaceutical work in China, but I think that actually, this is a really great thing for patients. China's pharmaceutical industry has industrialized a lot of the pre-clinical discovery work. And so right now, we can go from target to compound in humans in China very fast. Now, a lot of the target research is still done in the US, and some people see this as an adversarial thing, but I think it's also a de-risking thing. China is making this work with regulatory changes relative to the US and with more focus on automation and scale-up. So we can do the same thing in the US.

I think if we can push really big-level changes on the regulatory side, on pre-clinical speed, and on the work we're doing on automating the finding of targets and the early discovery work, combined with great work from companies like BigHat, which is really making antibodies be discovered faster, and Isomorphic Labs working on getting small molecules discovered faster—I think all of these things could work together and make a really fast speed-up.

And you can see evidence of acceleration. The GLP-1 work, right? We went from a peptide from Novo Nordisk down to a once-daily pill in a matter of, I don't know, three years or something. Very quick speed. And now Eli Lilly has this once-daily pill. So, the speed at which we can go from a target to something that can modulate the target to a once-daily pill is getting faster. And so I have high hopes about it.

But there are so many things out of my control as a player in this space, like what happens to the FDA. The feedstock for finding targets is academic research. A lot of the stuff that we do is we accelerate academic research or we can parse through academic research faster, but the input, the oil to this process, is academic research. And the NIH is getting its budget gutted, and the NSF is getting its budget gutted. And with inflation and tariffs, that input to this process is slowing down. So maybe more work has to be done. Maybe things like what Emerald Cloud Labs is doing or what Ginkgo Bioworks is doing on automating some of the fundamental research. Maybe that's an important component. But there's so much stuff going on. I know the theme is existential hope, so I think there is definitely hope for huge improvements here. All the ingredients are here, but it's such a funny thing where it's not really a techno-economic solution. It's almost a policy and will and where we focus our work.

Allison: You named a few of the bottlenecks that come up in almost every single foresight workshop. One that you particularly have your eyes on is the scientific publishing ecosystem. I know that you guys have also published something quite rebellious recently on the scientific publishing system, but you in particular also have a draft out right now that goes something like "Publishing for Machines." It's relatively universally observed that as scientific research and progress is accelerating, the traditional straitjackets of the journal-industrial complex aren't cutting it anymore. So if that is a thing that we need to reinvent or potentially entirely invent anew, what is your go-to strategy for doing that better?

Andrew: I think one of the things... it can be very easy to be pessimistic right now about the state of science, but I also think science has kind of reached an ossification as well, institutionally. We'vealmost reached a point where the direction of science is almost self-fulfilling. Academics choose what they think is important, and then academics pursue that, and it has become disconnected from societal impact. So I think some of the huge negative changes that are happening in science right now will lay the foundation for, I'm hoping, a revival of science in a more socially aligned direction. I have a lot of hope that some of the stuff that's happening in politics and some of the things that are changing in where people go to train or how people think about institutions... I'm hoping that's going to be sort of like a forest fire, something that then leads to a rebirth of how we consider our directions. So I'm hopeful that we'll have an opportunity here to move science in a way that I don't think was possible two or three years ago because of the entrenchment of the institutions.

But back to publishing. I think publishing is another great example of an unsustainable, ossified thing that we don't know how to break out of. And I want to also start this by saying that I don't think it's just that publishers are bad. I think that's a very simplistic idea of this system. Every day, scientists choose to continue participating in this system. Whenever we look at something, we say, "Oh, was this in Science? Well, then it must be taken seriously." "Oh, was this in Frontiers? It should not be taken seriously." Or, "Oh, was this on bioRxiv? Well, we know bioRxiv is not peer-reviewed, so we don't take it seriously." "But it's published in Cell? Well, then it's going to be taken seriously." So I think we're all active participants in this process. It's not just that publishers are holding us hostage.

But forget publishers or anything like that, we just have a problem with the speed of science. Every country has scaled up scientific researchers. Every country has realized this is an engine for their economy, and so there are just more people doing science every year. Some people like to just shut out that idea. It's very easy to ignore bad things happening in other parts of the world because our brains are wired to focus on our community. Some people just ignore science happening in India or China or in Africa because they just want to say it's all bad or dismiss it all because it's in less good journals, because it helps them focus on what they can comprehend. So I think we have an information overload as well because there is going to be a time at which there are not just 10,000 important researchers in the United States or Europe. There's going to be 100,000 or a million researchers that are all doing important work.

Publishing is, of course, a symptom of that narrow view we have right now of this industrialization of science. It's becoming not a professional guild anymore; it's like a professional, industrialized class of people doing science everywhere. So we see this, there are 250 million papers now. There are 6 to 7 million papers coming out every year. We're at an all-time high for submissions to arXiv and bioRxiv. Nobody can possibly read all the papers that are coming out. And it's not because it's all bad. I think that's just too simplistic. It's because there's a lot of science coming out. So in this blog post that I put together, this essay, I'm arguing that we need to start thinking about the consumers and producers of science as machines because we're reaching the point where human beings are not going to be able to comprehend all the information there, or even want to be. And so this was sort of the impetus of, we have to figure out how to deal with this huge influx of science and also this big tectonic shift in where and how well science is done.

Allison: If we now think about AI as being the main producer but also consumer of scientific research in the next 5 to 10 years, how does the scientific publishing ecosystem have to change to accommodate for that? How can we build for that from the get-go now that we know that it's actually happening?

Andrew: In my essay, I wrote a few ideas of what this looks like, but I have low conviction that I've solved it alone. But I think some of the most important things are latency—the time from someone finishing a project to it being reported out. As you said, in the pandemic, latency was something that we all recognized as a big problem in publishing. And then post-pandemic, the peer-review pool has been eaten up, and it just takes months and months for papers to be published. So I think people are taking peer review less seriously because it's almost become a lagging indicator, and whether a paper is great or not is already decided well before the peer reviews come back because of preprint servers and things like that. So I think latency is one of the most important things we have to change. If we really believe we're accelerating science, we can't have a six-month lag time.

Another thing is that for machines reading papers, the PDF has been a disaster, and I think paywalls have been a disaster for the scientific corpus. Basically, right now language models are trained on the internet because the internet's free and open and you can download the whole thing. Science should have been what we train language models on. It is a bigger, more faithful representation of the world. It represents hundreds of years of research. It's better connected. Every article has clear citations to other articles. But we've lost that. It didn't happen, and that's sort of a fault of ours. But there was that quote, "The best time to plant a tree is 20 years ago. The second best time is now." We need to start changing how we represent science.

In the essay, I talk a little bit about what the key components of a research paper are. I think we need to slim down research papers as well. Again, because of professional standards, articles published in Cell, Science, and Nature are just so much information, and it's all shoved into the supporting information, and the figures are so complicated. It's reaching a point where now the atomic unit of science is not really a discovery. The atomic unit of science is like a five-year project that someone needs to get out the door so that they can go on the faculty market. So that has to change as well. I think we need to have slimmer papers.

Then another thing is that it used to be that journals would publish most any important work, and whether it would go to Science or Physical Chemistry B was a matter of whether it would be interesting for the readers. Now it's become just a prestige signal. And so again, this is something that slows down science, is that we basically have to have this artificial scarcity mindset of articles, and that has to change. It just shouldn't be a part of the scientific publishing process. We shouldn't need to select how important work is when we go to publish it. I think the curation has to be separate.

And peer review as well. How do we fix peer review? It's very broken. I think we just need to separate out the types of peer review. There's peer review for making sure that it's not scummy, spammy work. There's peer review to make sure it's faithfully citing papers and there are not duplicated images and it's a coherent paper. But then there's peer review which is the debate of science. And I think that part of peer review is very important, but it shouldn't be hidden. There's some kind of coupling that happened a long time ago where the gatekeeping to make sure it's actually scientific work has been coupled with the private debate of where the field is. And I think we got to pull that out of peer review. So I think there needs to be some kind of category of publications where people critique each other. That is going to be super hard. I almost feel like scientists are exceedingly polite in public. Of course, some scientists are rude, but I think 99% of scientists are very professional and very polite, and that's become a kind of persona of being a scientist. But then when it comes time for the private critique in the peer-review process, when it's anonymous and it is determining whether a paper gets into a journal, they're not super nice. And that's where the debate happens. So we got to figure out a way to get that out of there, get that back into the public.

I had a lot of ideas in this essay. I guess it's just a mindset shift. We need to recognize that we need to be writing for machines as a big reader, and we need to be understanding that producers of research will be machines soon. And there are some changes that we should make because otherwise the alternative is that people like Future House or people like Google Co-scientist or Anthropic who are building these AI scientists to do research, they'll just publish for themselves. They don't want to mess around with submitting to Cell. We're not going to mess around submitting a bunch of papers to Nature. We can't afford publication fees, and we all wait around, and there are all these forms you have to fill out. It's a lot of work. So, we'll just start building our own parallel corpus of science. And if we don't have this shared resource anymore, this community good, then it'll be a tragedy and a pretty big setback for science.

Allison: That's interesting to zoom out and just look at the broader ecosystem and the effects that individual players have on it. I think another area where that's TBD to some extent is whether in the future it's mostly going to be a process of centralization of intelligence or specialization. For example, one could imagine a future in which if I want to have a scientific question answered, I use Deep Research at GPT-5 or one of the major labs' agents, and that one agent can pretty much answer any questions on any topic. Another future is one where you have many different organizations, perhaps Future House and many other specialized entities, producing super-intelligent specialists that are really super-intelligent on one specific scientific domain, and you have more of an ecosystem in which these different super-intelligent specialists collaborate to find the right answer.

I'm really curious to hear if you are thinking about this at all, about how this is all going to play out in the race up to AGI. Do you see yourself as a player in that meta-game? And if so, how do you see Future House contributing to that larger AI and AGI ecosystem?

Andrew: I mean, I think you've hit on the crux here, and I think it comes down to whether you believe that we can solve scientific problems from first principles or if they require some sort of empirical measurement. So there may be some centralization of things like mathematics. There may be just the best system for doing math because math is, to some extent, really just symbolic. Whereas if you go to biology, maybe we build the most intelligent thing ever and it's able to intuit everything, but it still can't tell me what will make the cells in my hand change their phenotype or what will make them grow better, because that thing just can't see the cells in my hand. There has to be some kind of process of measurement. There has to be some kind of empirical measurement.

And so I believe that many scientific problems, many research problems, require some kind of tool or some kind of measuring device. And I think that's what will separate these two hypothesized futures of a centralization of intelligence and a distribution of intelligence: fundamentally, we live in a physical world, and so there's going to be some locality. At the end of the day, there are just some tools that can't be derived from pure symbolic reasoning.

So I think there will always be room, or there always will be a role, where maybe it's a simulation or maybe it's a database lookup. These things are not... ChatGPT-8 is not going to memorize all sequence information from all organisms ever on Earth. Or maybe it does, but then the next day there's a whole bunch of new sequences that come out in the world. So there's always going to be a role for these tools. And these tools are like looking up data, using simulation, and then going to be physical measurements. So I think those things just can't escape the process of discovery. Intelligence can't escape those. So I think there'll always be a role for these decentralized things.

Now, how does Future House fit into this? I don't know. I think of Future House as a research institute. We are trying to solve research problems. So the legacy of Future House is probably going to be demonstrating that this is possible. But Future House is whatever, 25 people, and we don't have an endowment, so I don't know if Future House is going to really be the big player at the billion-dollar scale. I don't know if we'll be the one that solves this or that is a huge player in this, but I think we're definitely showing that this is possible and we're definitely showing how to make progress in this area.

Allison: One normative argument that people sometimes make for having it all be centralized in the big labs is the fact that with the proliferation of scientific knowledge also comes the proliferation of potential knowledge to do harm or to combine some dangerous capabilities into something that can produce a "small-kills-all" risk in bio or cyber or what have you. As more or less believers and defenders of the open-source and open-science community, how do we address these very real risks head-on without being Pollyannaish about them?

Andrew: From a philosophical point of view, concentrating as much power into a few people as a way to prevent some sort of catastrophic event seems like the opposite idea. And so I think I don't want to sound cynical on this, but I think bio has become the mechanism for why this needs to happen. But as far as bio being a risk area, I think biology has by far the highest activation energy for actually causing risk. If you look at some very dark times in history when people tried to cause terrorist attacks or harm, biology has always been entertained by them and then immediately dismissed. In the Japanese terror attacks in the '90s, they tried to use anthrax, and it didn't really work very well, and they realized chemical weapons were a much easier thing to go after. But then it required building a big factory. So they had to find a way to buy a factory. It was actually just a ton of work.

So when I think about the risks of AI right now or in the short term, I think it's got to be more things in cybersecurity. I think that's where there's just no activation energy to use these systems. And what is a great way to protect against cybersecurity risks? I think it's a distribution of intelligence, not a concentration of intelligence. But anyway, I'm not as informed on cybersecurity. But I think to some extent, bio has become like a boogeyman for why open source can't exist. And I think it's a weird boogeyman to use because bio is intrinsically just very hard because of the physical world, not because of any open questions.

To give a specific example, people like to talk about resurrecting smallpox and killing everyone on Earth with it as a biorisk. You can download the sequence for smallpox. It's on the internet. It's been on the internet for many years. The question is not what is the information in it. It's how do you go through the tedious task of actually assembling the genome for the virus and getting it into human cells? It's not really a field which is intelligence-limited.

But now, how do we try to take these risks seriously? Because there are risks, and definitely the risks will grow as intelligence grows. Any sort of plan to control all information in one centralized source is such a weak strategy. I think it has to be something that's...

Allison: It's a honeypot, too.

Andrew: Yeah, it's a honeypot, too. Yeah, I guess so. But I think it has to be something with multiple layers of defense. In bio, people are trying to work on DNA screening technology. We could put more money into that. There's biosurveillance, people working on wastewater surveillance. People who make LLMs try to prevent them from doing dangerous work easily; they put some safeguards on there. There's also professional training. Whenever you have a lab, you have to go through safety training, and there's also some screening of people who can work in the lab with dangerous reagents. There's KYC in ordering reagents. So I think there are lots of boring, dumb ideas that just have to be done and will help protect against this. And I think any sort of grand, single-shot solution that involves concentrating all the intelligence in one area and then monitoring the input and output from the system, I just think is intuitively a very high-risk approach.

Allison: Creating a single point of failure through either a honeypot for extortion, for exfiltration of information, for attack, for shutdown, for accidents, etc., to occur. I do think that a decentralization of scientific capabilities also means a decentralization of some of the risks thereof. So I'm hoping that one thing that the open science community would have to contend with is not just sitting idly by, but also creating more open-source intelligence, open-source monitoring, and an open, democratic appeal for what some of the risks that we should be watching out for are so that we can catch actors early in this kind of more federated intelligence network. I don't think we can just go along as we please, but there do need to be some tools that have to be built, but they can be built in the open, I think, by a variety of actors keeping each other in check. And ultimately, I think that just intuitively sounds like a more safer way to go, but there's stuff to be built for sure.

If you think about perhaps a final question... I know that we're nearing time here. We at Foresight have been giving grants in the AI space now for a few years. We've had an eight-year fellowship where lots of people are using AI for bio, neuro, etc. We give prizes in this area, and overall it seems that the space is very rapidly emerging and really flourishing. There's just a ton of really interesting projects in the AI for science domain out there, and I think we're just getting started. This AI for science revolution, we're on the cusp of it. We've barely gotten going on this. I think it's really exciting if I just look at all the applications that we're getting. I couldn't possibly be more excited about what's ahead.

If you are an early career scientist thinking about how to use AI to advance your work, what would you advise for someone who's getting their fingers wet on using some of the AI for science tools? How should they position themselves to really leverage the opportunity that that field represents?

Andrew: That's a great question. I think what you want to do is experiment a lot, but also, you need to really train your ability to critique and be honest with yourself. We've had some people get started in AI for science here at Future House, and these models are getting so good that you can really think you've built something amazing. But you have to keep that very critical scientific eye on what you're making and make sure that you really have discovered something good. So I think making sure you know how to evaluate work in an unbiased way, making sure you know you're ready to be honest and critique what you think you've made—those are important skills to keep.

How do you get started on really doing creative, cool stuff? I think the open-source ecosystem here is very mature at this point. There are a lot of really good projects. Lots of really good scaffolding for putting LLMs with tools. We have built some stuff at Future House. We have this called Aviary, which is like a way to put together scientific agents. You can use our platform to do precedent search. So in some ways, a lot of the tools are there, and we're looking for people to try cool ideas.

I think another difference in getting into AI for science versus other domains is that there is this issue of, "Okay, I need to get an OpenAI account or I need to get an Anthropic account to get at the cutting edge right now." And I think open-source inference is actually becoming harder. In the old days, with a relatively small amount of GPUs, you could compete with open-weights models. But nowadays, GPUs are becoming more scarce for people in AI for science in an academic setting or even in a research institute. The compute requirements for running these open-source models are actually becoming an issue. And so I think there's also this question of how do you make progress on a light compute budget or using frontier LLMs?

Another thing that is exciting is I think some of the work on non-LLMs is still really cool, like people exploring graph neural networks to describe the relationships between agents or using equivariant neural networks to try to model some of these things at an all-atom level. So I think we don't want to leave behind some of the really cool directions of machine learning and go only for language models.

Anyway, this is not really great advice. I guess it's just that we're at the turn of the century, right? When the atomic revolution was happening, people were discovering what is an atom, what is a neutron, discovering radiation and X-rays and relativity. And we're at another one of these crazy inflection points, and nobody knows what's the right answer. So, you know, just be a member of the community to keep apace, have an open mind, but keep that critical scientific eye because I think there are a lot of traps you can fall into these days with AI because it can be so alluring and it does really want to be helpful and harmless, and that can be a trap. But also, get that hacker mindset of exploring stuff quickly and trying it out.

Allison: We are likely about to launch some physical hubs where you have not only physically collocated compute that can be used for AI and science projects but also some grant funding and some kind of community creation in the AI for science world. Because I think often many of these projects are kind of rebels flying in the dark. They're not really quite sure where other related projects are, and they're kind of widely spread out across the world. So, I think that we're on the cusp of creating a community around this stuff, and I really encourage folks to dip their toe in it. It's a really exciting space.

Last question. How can people learn more about Future House or about ongoing projects that you have, how to support, how to get in touch with you guys, what's in the pipeline? Any final words about how to engage with you?

Andrew: Yeah, futurehouse.org. We have our research blog there. We talk about what work we've done. At futurehouse.org, you can get to our platform and use our agents and try out ideas on the Future House platform. We have a link to the docs, and that is like a cookbook of all the tools we've made and how to use them, from open-source things like PaperQA and Aviary to building your own agents to using our platform as an API for your own work. And yeah, I'm on X, I'm on LinkedIn. You can DM me, happy to chat about ideas. We try to build a community around Future House, and we're really excited to see what can be built with it.

I think what you guys are talking about is awesome, too. The physical hubs are great. I think one of the things that's just hard about this is that there are no existing professional communities of AI for science that are really established, right? I don't know what's the right mailing list or what's the right group to talk to. There's no journal for it. So, I think it's great to get more community in this space because there's a lot of super-fast and also a lot of siloed information that is hard to get to. So, we're doing what we can to get that out.

Allison: Awesome. Thank you so much for coming on, and I'm excited to hopefully check in with you again in the future when you've published a few more mind-boggling papers. So, thank you so much, and thanks, everyone, for listening. Bye-bye.

Andrew: Thanks for having me. Bye-bye.

‍

Read

Andrew White | Building an AI Scientist to Automate Discovery

about the episode

About Xhope scenario

Xhope scenario

Transcript

RECOMMENDED READING

Related episodes

Adam Brown | A Theoretical Physicist's Take on the Future

Stuart Buck | What is Good Science?