Research in the biological sciences has become increasingly data intensive, with vast datasets now routinely generated by genome sequencing, gene expression profiling, advanced microscopy, video imaging, and other high-throughput experimental techniques. These datasets have proven to be valuable in driving major scientific breakthroughs, but researchers face substantial challenges in keeping up with the storage, transfer, and processing of information.
When researchers have to queue up for computational resources on conventional or supercomputers, their projects can grind to a halt. The future of biomedical research demands the flexible resources of cloud computing, where the fastest processors are available on demand and at scale. Making this happen requires future of biomedical research demands the flexible resources of cloud computing, where the fastest processors are available on demand and at scale. Making this happen requires close collaboration between IT professionals and experimental biologists, all working together to advance scientific discovery.
Recognizing this new reality, in May 2022, the Salk Institute, a leading private, non-profit research institution based in La Jolla, California, launched a pilot program with Google Cloud to optimize the large-scale processing of single-cell epigenomics sequencing data. Starting in the laboratory of Joseph Ecker, a Salk Professor and director of a National Institute of Health BRAIN Initiative Cell Census Network (BICCN) project, the pilot tested if and how cloud computing could transform the institute’s computing infrastructure and improve workflows. The goal was to adapt existing data analysis pipelines to the cloud, with an emphasis on training Salk scientists to devise their own cloud computing solutions.
By moving to Google Cloud, Ecker’s team hoped to avoid the bottlenecks they faced when managing big data analysis of individual cells in complex systems. Ecker says that “we generate a lot of ‘omics data–for example, measuring cancer cell traits or neurological disease traits using sequence information from diseased organs versus normal cells. It’s a huge pressure just storing and processing all that data. One experiment might analyze thousands of cells, where each cell has 1-2 million sequencing reads mapping to the whole genome, detecting the status of tens of millions of individual Cytosines (methylated or not). That generates terabytes of data–and then you want to do multiple experiments, obviously. And you need to back it all up and maintain it. We needed better solutions.”
For the pilot, the Ecker Lab analyzed individual mouse brain cells to create a complete functional map with markers for DNA. The total raw sequencing data could run to hundreds of terabytes, and they needed to locate and understand each cell in relation to the others. Hanqing Liu, post-doctoral researcher in the Ecker Lab, explains how they automated their workflow to process multiple batch jobs with Google’s Virtual Machines (VMs): “we set up a pipeline to allow our batch jobs to routinely submit to Google Cloud. That gets executed and then automatically moved to long term storage, also in Google Cloud storage.”
The result was a resounding success, with the whole mouse brain mapped at the molecular level for the first time. The computing costs also dropped by twenty percent compared to their original estimates. Liu says, “Using preemptible VMs is so efficient. They give us scalability and stability.” Ecker adds that “we were looking for an IT team that would be hands-on in helping us to migrate to the cloud. Now we feel quite comfortable with running the pipelines ourselves.”
Ecker says, “our first goal is to use the mouse brain as a model to really understand the diversity of cells in the brain and how they’re regulated. Once we’ve established tools to do this, we can move to working on primate and human brains.”
Following the pilot project, Ecker and his collaborators landed a five-year $126 million NIH grant to create the first complete epigenome-based molecular map of the human brain, which has 20 times more data than a mouse brain. “We showed that we could manage our own data,” Ecker explains. “The pilot demonstrated that this is now how science gets done.” Gerald Joyce, Senior Vice President and Chief Science Officer at the Salk, points to the big-picture takeaways for the Institute: “it’s taught us that IT and research need to be on equal footing. We learned that we need IT engineers and professionals shoulder to shoulder with our scientists. That means they’re in the labs.”
With the success of the pilot, other labs at the Salk Institute are now working to move their data-intensive projects to Google Cloud. “Salk scientists have been using machine learning to track and analyze vast amounts of movement data from worms, flies, and mice, and these data have led to new insights regarding neural circuits that underlie movement and complex social behaviors,” Joyce reports. “We’re moving toward a 3D view of the cell, at atomic resolution.” Other high-impact projects include Associate Professor Eiman Azim’s research in the biomechanics of human movement and Professor Joanne Chory’s work to map plant growth in response to climate change.
To accelerate these discoveries, the Salk has launched a $23 million, five-year initiative to transform their organization to cloud infrastructure. This will enable them to recruit the best talent, continue attracting significant grant funding, collaborate with global partners through centralized data access, and manage the ever-expanding demands of massive datasets. “The deluge of data is already at hand,” says Joyce. “More data does not automatically translate to more scientific insights, but the ability to mine the data efficiently is a prerequisite for such insights. With a meaningful investment in computing capabilities, Salk researchers will be able to harness these varied datasets to tackle currently unaddressed scientific questions and to open fundamentally new areas of scientific inquiry.”
Cloud BlogRead More