October 23rd (this past Saturday!) was my 4th Googlevarsery and we are wrapping an incredible Google Next 2021!
When I started in 2017, we had a dream of making BigQuery Intelligent Data Warehouse that would power every organization’s data driven digital transformation.
This year at Next, It was amazing to see Google Cloud’s CEO, Thomas Kurian, kick off his keynote with CTO of WalMart, Suresh Kumar , talking about how his organization is giving its data the “BigQuery treatment”.
AS I recap Next 2021 and reflect on our amazing journey over the past 4 years, I’m so proud of the opportunity I’ve had to work with some of the world’s most innovative companies from Twitter to Walmart to Home Depot, Snap, Paypal and many others.
So much of what we announced at Next is the result of years of hard work, persistence and commitment to delivering the best analytics experience for customers.
I believe that one of the reasons why customers choose Google for data is because we have shown a strong alignment between our strategy and theirs and because we’ve been relentlessly delivering innovation at the speed they require.
Unified Smart Analytics Platform
Over the past 4 years our focus has been to build industries leading unified smart analytics platforms. BigQuery is at the heart of this vision and seamlessly integrates with all our other services. Customers can use BigQuery to query data in BigQuery Storage, Google Cloud Storage, AWS S3, Azure Blobstore, various databases like BigTable, Spanner, Cloud SQL etc. They can also use any engine like Spark, Dataflow, Vertex AI with BigQuery. BigQuery automatically syncs all its metadata with Data Catalog and users can then run a Data Loss Prevention service to identify sensitive data and tag it. These tags can then be used to create access policies.
In addition to Google services, all our partner products also integrate with BigQuery seamlessly. Some of the key partners highlighted at Next 21 included Data Ingestion (Fivetran, Informatica & Confluent), Data preparation (Trifacta, DBT), Data Governance (Colibra), Data Science (Databricks, Dataiku) and BI (Tableau, PowerBI, Qlik etc).
Planet Scale analytics with BigQuery
BigQuery is an amazing platform and over the past 11 years we have continued to innovate in various aspects. Scalability has always been a huge differentiator for BigQuery. BigQuery has many customers with more than 100 petabytes of data and our largest customer is now approaching an exabyte of data. Our large customers have run queries over trillions of rows.
But scale for us is not just about storing or processing a lot of data. Scale is also how we can reach every organization in the world. This is the reason we launched BigQuery Sandbox which enables organizations to get started with BigQuery without a credit card. This has enabled us to reach tens of thousands of customers. Additionally to make it easy to get started with BigQuery we have built integrations with various Google tools like Firebase, Google Ads, Google Analytics 360, etc.
Finally, to simplify adoption we now provide options for customers to choose whether they would like to pay per query, buy flat rate subscriptions or buy per second capacity. With our autoscaling capabilities we can provide customers best value by mixing flat rate subscription discounts with auto scaling with flex slots.
Intelligent Data Warehouse to empower every data analyst to become a data scientist
BigQuery ML is one of the biggest innovations that we have brought to market over the past few years. Our vision is to make every data analyst a data scientist by democratizing Machine learning. 80% of time is spent in moving, prepping and transforming data for the ML platform. This also causes a huge data governance problem as now every data scientist has a copy of your most valuable data. Our approach was very simple. We asked:”what if we could bring ML to data rather than taking data to an ML engine?”
That is how BigQuery ML was born. Simply write 2 lines of SQL code and create ML models.
Over the past 4 years we have launched many models like regression, matrix factorization, anomaly detection, time series, XGboost, DNN etc. These models are used by customers to solve complex business problems simply from segmentation, recommendations, time series forecasting, package delivery estimation etc. The service is very popular: 80%+ of our top customers are using BigQueryML today. When you consider that the average adoption rate of ML/AI is in the low 30%, 80% is a pretty good result!
We announced tighter integration of BQML with Vertex AI. Model explainability will provide the ability to explain the results of predictive ML classification and regression models by understanding how each feature contributes to the predicted result. Also users will be able to manage, compare and deploy BigQuery ML models in Vertex; leverage Vertex Pipelines to train and predict BigQuery ML models.
Real-time streaming analytics with BigQuery
Customer expectations are changing and everyone wants everything in an instant: according to Gartner, by the end of 2024, 75% of enterprises will shift from piloting to operationalizing AI, driving a 5X increase in streaming data and analytics infrastructures.
The BigQuery’s storage engine is optimized for real-time streaming. BigQuery supports streaming ingestion of 10s of millions of events in real-time and there is no impact on query performance. Additionally customers can use materialized views and BI Engine (which is now GA) on top of streaming data. We guarantee always fast, always fresh data. Our system automatically updates MVs and BI Engine.
Many customers also use our PubSub service to collect real-time events and process these through Dataflow prior to ingesting into BigQuery. This is a streaming ETL pattern which is very popular. Last year,we announced PubSub Lite to provide customers with a 90% lower price point and aTCO that is lower than any DIY Kafka deployment.
We also announced Dataflow Prime, it is our next generation platform for Dataflow. Big Data processing platforms have only focused on horizontal scaling to optimize workloads. But we have seen new patterns and use cases like streaming AI where you may have a few steps in pipelines that perform data prep and then customers have to run a GPU based model. Customers want to use different sizes and shapes of machines to run these pipelines in the most optimum manner. This is exactly what Dataflow Prime does. It delivers vertical auto scaling with the right fitting for your pipelines. We believe this should lower costs for pipelines significantly.
With Datastream as our change data capture service (built on Alooma technology), we have solved the last key problem space for customers. We can automatically detect changes in your operational databases like MySQL, Postgres, Oracle etc and sync them in BigQuery.
Most importantly, all these products work seamlessly with each other through a set of templates. Our goal is to make this even more seamless over next year.
Open Data Analytics with BigQuery
Google has always been a big believer in Open Source initiatives. Our customers love using various open source offerings like Spark, Flink, Presto, Airflow etc. With Dataproc & Composer our customers have been able to run various of these open source frameworks on GCP and leverage our scale, speed and security. Dataproc is a great service and delivers massive savings to customers moving from on-prem Hadoop environments. But customers want to focus on jobs and not clusters.
That’s why we launched Dataproc Serverless Spark (GA) offering at Next 2021. This new service adheres to one of our key design principles we started with: make data simple.
Just like with BigQuery, you can simply RUN QUERY. With Spark on Google Cloud, you simply RUN JOB. ZDNet did a great piece on this. I invite you to check it out!
Many of our customers are moving to Kubernetes and wanted to use that as the platform for Spark. Our upcoming Spark on GKE offering will give the ability to deploy spark workloads on existing Kubernetes clusters.
But for me the most exciting capability we have is, the ability to run Spark directly on BigQuery Storage. BigQuery storage is highly optimized analytical storage. By running Spark directly on it, we again bring compute to data and avoid moving data to compute.
BigSearch to power Log Analytics
We are bringing the power of Search to BigQuery. Customers already ingest massive amounts of log data into BigQuery and perform analytics on it. Our customers have been asking us for better support for native JSON and Search. At Next 21 we announced the upcoming availability of both these capabilities.
Fast cross column search will provide efficient indexing of structured, semi-structured and unstructured data. User friendly SQL functions let customers rapidly find data points without having to scan all the text in your table or even know which column the data resides in.
This will be tightly integrated with native JSON, allowing customers to get BigQuery performance and storage optimizations on JSON as well as search on unstructured or constantly changing data structures.
Multi & Cross Cloud Analytics
Research on multi cloud adoption is unequivocal — 92% of businesses in 2021 report having a multi cloud strategy. We have always believed in providing customers choice to our customers and meeting them where they are. It was clear that all our customers wanted us to take our gems like BigQuery to other clouds as their data was distributed on different clouds.
Additionally it was clear that customers wanted cross cloud analytics not multi-cloud solutions that can just run in different clouds. In short, see all their data with a single pane of glass, perform analysis on top of any data without worrying about where it is located, avoid egress costs and finally perform cross cloud analysis across datasets on different clouds.
With BigQuery Omni, we deliver on this vision, with a new way of analyzing data stored in multiple public clouds. Unlike competitors, BigQuery Omni does not create silos across different clouds. BigQUery provides a single control plane that shows an analyst all data they have access to across all clouds. Analyst just writes the query and we send it to the right cloud across AWS, Azure or GCP to execute it locally. Hence no egress costs are incurred.
We announced BQ Omni GA for both AWS and Azure at Google Next 21 and I’m really proud of the team for delivering on this vision. Check out Vidya’s session and learn from Johnson and Johnson how they innovate in a multi-cloud world.
Geospatial Analytics with BigQuery and Earth Engine
We have partnered with our Google Geospatial team to deliver GIS functionality inside BigQuery over the years. At Next we announced that customers will be able to integrate Earth Engine with BigQuery, Google Cloud’s ML technologies, and Google Maps Platform.
Think about all the scenarios and use-cases your team’s going to be able to enable sustainable sourcing, saving energy or understanding business risks.
We’re integrating the best of Google and Google Cloud together to – again – make it easier to work with data to create a sustainable future for our planet.
BigQuery as a Data Exchange & Sharing Platform
BigQuery was built to be a sharing platform. Today we have 3000+ organizations sharing more than 250 petabytes of data across organizations. Google also brings more than 150 public datasets to be used across various use cases. In addition to this, we are also bringing some of the most unique datasets like Google Trends to BigQuery. This will enable organizations to understand in real-time trends and apply to their business problems.
I am super excited about the Analytics Hub Preview announcement. Analytics Hub will provide the ability for organizations to build private and public analytics exchanges. This will include data, insights, ML Models and visualizations. This is built on top of the industry leading security capabilities of BigQuery.
Breaking Data Silos
Data is distributed across various systems in the organization and making it easy to break the data silo and make all this data accessible to all is critical. I’m also particularly excited about the Migration Factory we’re building with Informatica and the work we are doing for data movement, intelligent data wrangling with players like Trifacta and FiveTran, with whom we share over 1,000 customers (and growing!). Additionally we continue to deliver native Google service to help our customers.
We acquired Cask in 2018 and launched our self service Data Integration service in Data Fusion. Now Fusion allows customers to create complex pipelines with just simple drag and drop. This year we focused on unlocking SAP data for our customers. We have launched various SAP connectors and accelerators to achieve this.
At GCP Next we also announced our BigQuery Migration service in preview. Many of our customers are migrating their legacy data warehouses and data lakes to BigQuery. BigQuery Migration Service provides end-to-end tools to simplify migrations for these customers.
And today, to make migrations to BigQuery easier for even more customers, I am super excited to announce the acquisition of CompilerWorks. CompilerWorks’ Transpiler is designed from the ground up to facilitate SQL migration in the real world and will help our customers accelerate their migrations. It supports migrations from over 10 legacy enterprises data warehouses and we will be making it available as part of our BigQuery Migration service in the coming months.
Data Democratization with BigQuery
Over the past 4 years we have focused a lot on making it very easy to derive actionable insights from data in BigQuery. Our priority has been to provide a strong ecosystem of partners that can provide you with great tools to achieve this but also deliver native Google capabilities.
BigQuery + Data Studio are like peanut butter and Jelly. They just work well together. We launched BI Engine first with Data Studio and scaled it to all the users. More than 40% of our BigQuery customers use Data Studio. Once we knew BI Engine works extremely well we now have made it an integral part of BigQuery API and launched it for all our internal and partner BI tools.
We announced GA for BI Engine at Next 2021 but we were already GA with Data Studio for the past 2 years. We recently moved the Data Studio team back into Google Cloud making the partnership even stronger. If you have not used Data Studio, I encourage you to take a look and get started for free today here!!
Connected Sheets for BigQuery is one of my favorite combinations. You can give every business user in your organization the ability to analyze billions of records using standard Google Sheets experience. I personally use it everyday to analyze all our product data.
We acquired Looker in Feb 2020 with a vision of providing a semantic modeling layer to our customers with a governed BI solution. Looker is tightly integrated with BigQuery including BigQuery ML. Our latest partnership with Tableau where Tableau customers will soon be able to leverage Looker’s semantic model, enabling new levels of data governance while democratizing access to data.
Finally, I have a dream that one day we will bring Google Assistant to your enterprise data. This is the vision of Data QnA. We are in early innings on this and we will continue to work hard to make this vision a reality.
Intelligent Data Fabric to unify the platform
Another important trend that shaped our market is the Data Mesh. Earlier this year, Starburst invited me to talk about this very topic. We have been working for years on this concept, and although we would love for all data to be neatly organized in one place, we know that our customers’ reality is that it is not (If you want to know more about this, read about my debate on this topic with Fivetran’s George Fraser, a16z’s Martin Casado and Databricks’ Ali Ghodsi).
Everything I’ve learned from customers over my years in this field is that they don’t just need a data catalog or a set of data quality and governance tools, they need an intelligent data fabric. That is why we created Dataplex, whose general availability we announced at Next.
Dataplex enables customers to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts, while also ensuring data is securely accessible to a variety of analytics and data science tools. It lets customers organize and manage data in a way that makes sense for their business, without data movement or duplication. It provides logical constructs – lakes, data zones, and assets – which enable customers to abstract away the underlying storage systems to build a foundation for setting policies around data access, security, lifecycle management, and so on. Check out Prajakta Damle’s session and learn from Deutsche Bank how they are thinking about a unified data mesh across distributed data.
Analysts have recognized our momentum and, as I look back at this year, I couldn’t thank our customers and partners enough for the support they provided my team and I across our large Data Analytics portfolio: in March, Google BigQuery was named a Leader in The Forrester Wave™: Cloud Data Warehouse, Q1 2021. And in June, Dataflow was named a Leader in The Forrester Wave™: Streaming Analytics, Q2 2021 report.
If you want to get a taste for why customers choose us over other hyperscalers or cloud data warehousing, I suggest you watch the Data Journey series we’ve just launched, which documents the stories of organizations modernizing to the cloud with us.
The Google Cloud Data Analytics portfolio has become a leading force in the industry and I couldn’t be more excited to have been part of it. I do miss you, my customers and partners, and I’m frankly bummed that we didn’t get to meet in person like we’ve done so many times before (see a photo of my last in-person talk before the pandemic), but this Google Next was extra special, so let’s dive into the product innovation and their themes.
I hope that I will get to see you in person next time we run Google Next!
Cloud BlogRead More