When we, the Nest team, set out to build the next generation platform to power millions of Nest cameras, it was essential to fulfill our video storage needs at large scale with reliability at the forefront. To accomplish this, we needed to choose the right database. This article goes into details about what our requirements were, why we picked Cloud Spanner, what some of the challenges were during the migration, and some of the operational tasks we undertook after migrating to Spanner.
Here are some of the factors we considered when choosing the next database for Nest cameras and why we selected Spanner. Specifically, we were looking for the following customer benefits:
Minimal downtime – we know that cameras are used for security purposes, so minimizing downtime was key. This requirement meant choosing a database with no maintenance windows and minimal downtime.
Lower total cost of ownership – to create new experiences for our customers and focus on developing better camera software rather than maintaining a database, we required a fully managed service.
Better end user experience – users expect a consistent and predictable experience across their security devices and need to trust that these devices will perform as expected. We wanted to ensure those requests would have consistent round trip time to provide predictability.
With a growing fleet of Nest cameras, we need to build for today and prepare for tomorrow. Switching storage databases represents thousands of hours of engineering effort, and can increase the risk of downtime for our customers during the change. We needed a system that could grow with our customers. Spanner has unlimited scaling, and resizing is easy with just one click. This database works well for us because we can now scale as we grow without worrying about overgrowing the capacity constraints of our database.
Nest cameras support uploading and watching recorded videos. In order for our customers to get updates, performance and consistency are key consideration factors. During our benchmarks, there was little change to Spanner’s performance as we scaled our workload and instance size.
Originally the Nest camera services were built on another cloud provider’s distributed data store. While performance was typically adequate, it lacked relational integrity, which placed a heavy burden on the application to handle cases where we needed to perform coordinated updates. This challenge required significant special error handling, and the need to run reconciliation jobs. Since Spanner is a relational database, it handles transactions processing for us in full. This capability allowed the engineering team to focus solely on the higher level application logic.
Nest cameras need to have high uptimes, so our underlying services need to be constantly up as well. Maintenance windows, however planned, create significant additional work for the engineering team. For example, running multiple databases via different services or in different regions means that we are constantly relying on a different service to accommodate variations in maintenance downtime. Spanner is one of the only few services with 99.99% uptime for regional and 99.999% uptime for multi-regions with no planned maintenance windows.
The Nest engineering team is a distributed team with a distributed set of programming languages. Spanner has client library support in 8 languages in addition to REST and RPC, covering most of the languages used within the team. In the future, should we implement new applications in a different language, we will be able to leverage the client library in that language.
When planning for a migration, we often think about how to migrate the data, what changes to make on the application, which integrations to reconfigure, etc. Here are some additional areas that we considered to ensure a smooth migration.
Subtle SQL dialect differences can sometimes be hard to detect, especially when they are not explicitly defined in the SQL statements. For example, when it comes to sorting, Spanner sorts NULLs and NaNs first whereas PostgreSQL sorts them last. ‘^’ means bitwise XOR in Spanner whereas it means exponentiation in PostgreSQL. We followed the syntax guide to ensure that our queries did exactly what we intended.
To make this even easier, Spanner now offers a preview for PostgreSQL dialect support. Read more here.
We also had to consider abnormal latencies during our migration. We defined latency targets and referred to Cloud Monitoring to ensure that our latencies stayed below our threshold. When we suspected there were hotspots, we used the Key Visualizer to identify them and reworked our schema following Spanner’s best practices.
There are some concepts like Sessions which are unique to Spanner. The client libraries that Spanner provides handle Session management transparently, whereas we will need to implement Session handling ourselves if we choose to use the RPC or REST APIs. We used the Java client library and that made our migration much simpler. We tuned various configurations for the client in our staging environment before settling on a set of parameters that worked for us.
After Spanner was in production, we set up alerting through Cloud Monitoring and regularly monitored our usage metrics (specifically CPU utilization and P99 latencies). We also resized our instance to ensure we are using just the right number of resources.
Choosing a database is an important decision for any large scale application. We’ve paid careful consideration to scalability and reliability challenges in order to ensure that Nest’s customers get a software service that is as reliable as their hardware devices. We have solved many deep technical challenges and implemented many optimizations to bring the fastest and most reliable performance to our customers. We hope this article gives you a glimpse into some of the key decisions we’ve made as part of this journey. To learn more about Nest cameras, take a look at our latest devices. You can also learn more about using Spanner in your organization.
Cloud BlogRead More