Building a high throughput ETL for Google Cloud’s public datasets platform

By mullaned2002

September 7, 2023

293

Editor’s note: This is the third of a three part series from the BCW Group, a Web3 venture studio and enterprise consulting firm serving enterprise clients who want to integrate existing products or otherwise branch into the Web3 space. Today’s blog discusses the frameworks used for creating one-click, end-to-end blockchain deployments into Google Cloud. You can read the first post here and the second post here.

Google Cloud’s BigQuery public datasets offer a window into the world of data-driven insights and analysis. These publicly available datasets offer a diverse array of domains, including economics, genomics, blockchain, and more. Researchers, analysts, and curious minds can penetrate datasets with the massive processing power of BigQuery. Whether unraveling complex trends in global trade, unlocking the mysteries of the human genome, or visualizing geographical patterns, the BigQuery public datasets empower users to find the story in the data.

BCW Technologies is working to create a framework that will help organizations and individuals access existing extract-transform-load (ETL) codebases and Google Cloud public datasets platform (PDP) so they can build a one-click, end-to-end, deployment into Google Cloud. We will be working primarily on the blockchain side of the PDP. These datasets can support many data strategies including academic inquiry, supplying historical chain data to users, or building out a comprehensive Big Data as a Service (BDaaS) platform.

Blockchains such as Ethereum or Bitcoin already generate large amounts of data. However, when discussing high-throughput chains such as Solana, building an ETL is not enough: developers must be cognizant of the entire pipeline. Since all states assist in validating all subsequent states, a comprehensive set with no trimmings is needed. Developers must consider all aspects of the extract-transform-load (ETL) process while calibrating their own deployments to keep up with these networks while designing a system that can be deployed and integrated by a wide array of participants.

The Blockchain ETL platform is designed to be a modular framework. It extracts blockchain data from a data source (such as an RPC node), then packages it using the efficient Protocol Buffers format before sending it to a data loader. This data loader first transforms the data to match the data storage schema before performing the insertion. In this case, the storage is Google Cloud BigQuery, which is well suited for the many terabytes of blockchain data. Given how much data is moving through modern blockchains, BQ’s large insertion support as well as scalable queries make it more than ideal for warehousing this data. Consequently, this makes the public dataset useful for use cases such as data science, comparison with other blockchains, or indexing historical on-chain actions.

The Solana blockchain has one of the shortest block times, thereby creating more frequent new blocks and blockchain state changes compared to other chains. This presents unique challenges. In this case, and pertaining to any blockchain network built on a rapid consensus mechanism, the data that is generated can quickly become unmanageable on the decentralized ledger where it is recorded.

BCWT seeks to build out an end-to-end, one-click, solution to the rapidly mounting level of data that highly performant chains generate. Authorizing configurations & submodules into an infrastructure-as-code deployment is important for the mission of assisting in democratizing the PDP data set (public is in the name afterall). By defining configurations that are compatible with both BCWT and community authored ETL, we seek to build out a solution that leverages the existing codebases, alongside our performance-oriented architecture.

The Solana ETL system consists of three primary parts with accompanying modules: a non-voting validator for keeping the state up-to-date, the transformer, and a Google Cloud BigQuery instance. The validator connects to the Solana network and follows up to three days worth of full state data (account snapshots included). Next, the transformer acts as a middleware between the data source and storage, and is deployed as a separate Compute Engine instance. Finally, Google Cloud BigQuery accepts the structured data.

BigQuery offers a tremendous amount of power for both querying and inserting data, but there is also the opportunity to be thoughtful in design to help control cost. By defining efficient partitioning routines and queries, selecting an appropriate egress point, and defining tables that map to your sources, controlling costs is possible. We are currently testing and authoring a collection of Solana-flavored repositories which can be seen here.

A highly performant programming language that can handle the immense data churn and scale must be used to compliment the aforementioned requirement of large-scale data in creating the ETL. In our case, we chose the highly performant systems language, Rust. This is also the implementation language of Solana, and offers speed, memory, and stability benefits resulting from its unique memory management , compilation optimizations, and zero-cost abstractions. Additionally, Rust’s support for conditional compilation has enabled us to design our codebase with modular blockchain support, such that we can add future blockchains in a plugin design. With each of these future blockchains, we plan to support the open source community with publicly available Rust packages to avail other developers direct library access.

One of our goals with Rust is to keep our code within community expectations to encourage open source collaboration. For example, we use well-known Rust libraries, including clap for our command-line interface, reqwest for making HTTP requests, Serde for deserializing JSON RPC responses, and Tokio to parallelize our extraction process. We also attempt to use functional programming as much as possible to accommodate data engineers interested in the project.

Upon deployment and initialization, the Solana ETL will first begin extracting data from Solana Labs’ instance of Google Bigtable (BT). It will also spin up a Solana Node and then begin working up from the genesis block, filling a designated BQ instance. Three data requests are necessary to extract data for the pipeline: (1) blocks, (2) accounts, and (3) token mints. The block data response includes all of the necessary data to populate the transactions, instructions, blocks, block rewards, and token transfers tables. However, Solana Labs’ BT instance only provides the block data, so an RPC node is required for the account and token data. Also important to note is the order of the data requests: account and token data is only requested when a transaction in the block creates an account or token mint, so the block data must be parsed first before these final requests can be made.

Initially, there will be a testing period, where Google Cloud will host a test dataset for selected Solana ecosystem participants and companies to engage with. Next, the overall apparatus will be deployed into the PDP organization and the dataset will be made accessible to all ecosystem participants. Finally, all code and deployment repositories will be added to the Blockchain-ETLrepositories for access by developers. These repositories will also include accompanying helmcharts, IAC configurations, and playbooks for rapid deployment into a GCP account.

While cloud technologies are often considered a centralizing force in decentralized technologies, it would be unrealistic for many to access the full scope of data that is generated by Solana. Archive nodes that handle both the full scope of all Solana blocks, accounts, and transactions are hard to find and access while including Geyser-enabled state snapshots is a very heavy lift for any one team or individual. By opening up access to a public data set in a highly performant infrastructure, Google Cloud helps to solve these challenges and make Web3 more accessible to traditional Web2 developers, companies and businesses that are all already using GCP every day. Google Cloud’s commitment to opening data access fosters innovation and discovery, inviting users worldwide to embark on exciting journeys of exploration within the boundless realm of data science.

Cloud BlogRead More

Previous articleImproving Safety of AI and Online Communities with PaLM 2

Next articleSnapLogic Delivers Predictable Pricing

Building a high throughput ETL for Google Cloud’s public datasets platform

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

New research: Search abandonment continues to vex retailers worldwide

Manage storage costs by automatically deleting expired data using Firestore Time-To-Live (TTL)

Getting Ready For The Full-Blown Autonomous Vehicular Cloud

POPULAR CATEGORY