There’s a lot we can learn from combining technology with science to help support the development of amazing discoveries. By using an AI system to predict protein shapes, we have the potential to accelerate research in every field of biology. Inside every cell in your body, billions of tiny molecular machines are hard at work. They are what allow your eyes to detect light, your neurons to fire, and the ‘instructions’ in your DNA to be read.
These intricate machines are known as proteins.
The protein folding puzzle
Protein folding is something that occurs naturally so that proteins become biologically functional, but it’s a complex process that sometimes fails. For decades, scientists have been trying to find a method to reliably predict a protein’s structure from its sequence of amino acids so we can better understand how proteins work.
The challenge? There are over 200 million known distinct proteins. Each one has a unique 3D shape that determines how it works and what it does. Because there are so many sequences and determining their 3-D structure experimentally is so time-consuming and expensive, scientists only know the exact structure of a tiny fraction of the proteins. And these experimental methods still fall far short of reliable statistical accuracy.
Deepmind’s gigantic leap
In 2020, Alphabet’s artificial intelligence research arm, DeepMind, made a massive breakthrough in predicting protein structures using a deep learning model called AlphaFold.
AlphaFold is trained on publicly available data consisting of about 170,000 protein structures, and is the first computational method that can regularly predict the 3D shape of a protein, at scale with a high degree of accuracy.Â
AlphaFold has already sent waves throughout the scientific community and has demonstrated the potential for AI to aid fundamental scientific discovery. Recently, Deepmind has made AlphaFold predictions available and open source to anyone. To date, more than 500,000 researchers from 190 countries have accessed the AlphaFold protein structure database to get closer to finding life-saving cures for diseases like Leishmaniasis and Chagas.
And now Deepmind has expanded the set of available predictions by more than 200 times (from nearly 1 million to nearly 214 million) to cover almost all cataloged proteins found in nature.Â
Open source predictions available on Google Cloud
Together, Google Cloud and Deepmind have released this dataset of predicted protein structures for plants, bacteria, animals, and other organisms as part of the Google Cloud Public Dataset program to enable bulk downloads at no cost. That means you can also create custom queries of the dataset using BigQuery!
Running AlphaFold on Google Cloud Vertex AI
Let’s say you want to run AlphaFold on your own in order to get protein structure predictions against your own set of data. There are a few challenges to keep in mind:
You need to set up feature engineering against genetic sequence databasesPreprocess dataAnd run those inputs against pre-trained models
All of this requires allocating CPUs or GPUs, hosting a notebook environment, and scaling up for larger experiments. It’s hard to build and configure an on-premise system or cloud server to use AlphaFold whether you just want to try it out or run it at scale as a large organization.
That’s why we’re excited to share a deep integration between Google Cloud and Deepmind. On top of the Public Datasets program we have created end-to-end code samples for AlphaFold on Vertex AI, a managed end-to-end ML platform, to help address these challenges and speed up deployment. With AlphaFold on Vertex AI, you can manage a data science or machine learning workflow in a single development environment. You get access to pre-configured compute, storage, and end-to-end production notebooks. We have removed the heavy lifting needed to set up new ML environments, automate orchestration, and manage large clusters.
The AlphaFold inference workflow can be simplified with Vertex AI: from data preparation to feature engineering and deployment. Unlike the manual set up, the orchestrator makes it possible to parallelize steps, get predictions faster, and with better tracking.
Try it out first using Vertex AI Workbench
For those of you who want to try out a simplified version of AlphaFold, we have a Colab notebook that uses no templates (homologous structures) and a selected portion of the BFD database. You can deploy right on Vertex AI Workbench, which lets you specify a custom container image that we’ve already created for you. You’ll be able to:
Configure access to genetic databasesConfigure GPU accelerationSearch against genetic databasesUse the pre-processed results as inputs to the AlphaFold model locallyÂ
In a little over an hour you can harness the power of AlphaFold to generate 3-D protein structures from amino acid sequences.Â
Run hundreds of experiments reliably using Vertex AI Pipelines
For organizations that want to run a full blown version of AlphaFold for many protein folding experiments a week, you’ll want an ML pipeline orchestrator. The AlphaFold Batch Inference solution is a set of code samples that uses Vertex AI Pipelines to support hundreds of concurrent inference pipelines with higher throughput to help you run experiments at scale. The solution uses Vertex AI Pipelines as an orchestrator and runtime, Vertex ML Metadata for metadata and artifacts, and Cloud Filestore to manage databases.
Because it’s built on Vertex AI Pipelines, you can automate, monitor, and experiment with interdependent parts of an ML workflow. The minimized inference elapsed times mean what normally would take you days, can now take you hours.
The solution includes two example pipelines:
1. The universal pipeline solution mirrors the exact logic in DeepMind’s open source inference script but decoupled into discrete tasks so you can run the same experiments faster, more efficiently, and with better tracking.
2. The customized pipeline solution shows you how to further optimize the inference workflow by parallelizing feature engineering steps so you can plug in your own database sources.Â
You get example components, pipelines, and notebooks to start, analyze, and recompile pipelines on different GPUs.
The AlphaFold Vertex AI Workbench solution is great for experimental use, while the AlphaFold Batch Inference solution on Vertex AI Pipelines is great for doing protein folding at scale with a strong process for reproducibility and tracking.
Now go forth and save the world!
Okay maybe that’s a bit hyperbolic, but this is inspiring stuff! What started as a 50 year challenge, to the discovery of AlphaFold, to being able to run it on Google Cloud, researchers, developers, and science enthusiasts now have access to one of the most pivotal advancements in the medical world. Even a non-specialist can easily use a Vertex AI notebook to exercise a simplified version of AlphaFold. The next answers to the mysteries of life and discovery of disease treatments have never felt more attainable. With these no-cost solutions to run AlphaFold on Vertex AI and the Public Dataset, you can help propel us in this worldwide endeavor.Â
Learn more about healthcare and life sciences solutions on Google Cloud here.Â
If you have feedback or want to share your experience with me, reach out to me at @stephr_wong.
Cloud BlogRead More