At Nylas, we’ve been fortunate to have organizations quickly embrace our mission to make the world more productive. We provide a platform of communications APIs for business productivity, which collects data from multiple applications, compiles those insights, and creates end-to-end automation workflows. Our growth has been phenomenal, posting revenue growth of 5X over the last two years as we deliver solutions that enable developers to overcome challenges associated with building products to increase people’s productivity. Through our scalable and secure infrastructure, we enable organizations to better handle the communication data underpinning business process automation.
Along with our rapid growth, we’ve seen large enterprises, particularly those within highly regulated industries come to us with more complex technology requirements. For example, the need to scale up and down our services as rapidly and cost-effectively as possible in order to shift from handling millions to billions of transactions at any time.
This past year, the Nylas team took on the challenge of reinventing our architecture to provide enterprise customers with a bi-directional universal email sync, security compliance with the highest enterprise standards, and industry-specific machine learning services.
We wanted to find a way to scale Nylas’ existing architecture to better solve for our enterprise customers’ needs. Here, we’ll dive into how, in less than a month, we re-built our entire architecture from the ground up and cut our P90 latency in half – all while scaling to support billions of transactions weekly as needed.
Improving Sync Speed, Reliability, and Latency with GKE
The Nylas platform’s legacy architecture was built with Python Flask on top of another cloud service, which had its own set of issues and was costly to scale. The costliness was due to the fact that Nylas processes over 20 TB of data daily, with a high volume of reads and writes hitting hundreds of MySQL shards.
The workload was partitioned as email accounts across the farm with a static assignment. Because email volume was unpredictable for each account, there were many hotspots that emerged throughout the day causing high variability in latency and, in severe cases, reliability issues.
We were overprovisioning the system by a wide margin to manage latency. From simulating the same traffic load whenever we doubled the infrastructure footprint, we found we could lower the P90 latency by 50 percent.
These considerations, along with the needs of a global fintech customer with strict security requirements, were top of mind for us throughout the rebuild. Some of the fintech customers’ security considerations included not persistently storing PII and ensuring that the lifetime of each server was short in order to avoid potential in-memory hacks.
New World of Kubernetes and Go
We went back to the drawing board to imagine a world where latency could be improved by several orders of magnitude while automating operations and reducing operational costs. We wanted to “Think Big” and tackle this challenge with bleeding-edge solutions versus an incremental infrastructure improvement.
We had the highest degree of confidence in Go, Kubernetes, and distributed cloud storage. Since Nylas’ code does a lot of connection handling, encryption, and string manipulation, we knew fundamentally that a compiled language like Go versus an interpreted language would significantly improve performance.
After some additional vetting, the Nylas engineering team determined that the security, scalability, and stability ofGoogle Kubernetes Engine (GKE) was second to none. Its capacity to run Kubernetes workloads securely at scale is why Nylas transitioned to a brand-new architecture. Even though this meant sacrificing years of work spent on our initial infrastructure, it was the right decision for our business and our customers.
Nylas now uses GKE to orchestrate the containers. Due to security requirements, nodes have to be cycled very quickly, so we are taking advantage of the fact that we can have 15,000 nodes in a Kubernetes cluster. We also usegVisor to run our containers and create strong isolation between the application and operating system. This helps us to lock down the host, memory, and storage access, and enforce the least permission principle at the operating system level.
The third update is primarily focused on short-term outcomes, and we’ve achieved excellent results. Our goal was to place our workload on horizontally scalable data stores with operational complexity masked because of resource constraints. In favor of immediate reliability and scalability, we decided to utilize cloud services.
In the legacy Nylas platform architecture, most data was stored on a self-managed MySQL fleet. After migrating to Google Cloud, we decided to useCloud Spanner as a relational data store. Cloud Spanner allows us to keep states in the application context. Fast read and write access is extremely critical for a large number of calls. Last but not least, Nylas does not hold PII; that information is pushed onto the customer’s event bus as soon as possible.
CloudPub/Sub is our event bridge and acts as another key technology. Extensive load testing two orders of magnitude higher than the current peak immediately validated our choice. Hosting large, performant data stores with high reliability is difficult. The combination of Cloud Spanner and Cloud Pub/Sub solved this problem extremely well.
Our decision to leverage GKE to handle container orchestration, Cloud Spanner for relational data storage, and Pub/Sub for the message bus has paid off. The Nylas platform’s new architecture—which we rebuilt in only three months – went live in early June 2021 and has scaled effectively since then. A very nice side note is that we were able to reduce our code stack from 500 MB to just 7 MBs.
The platform has performed over seven billion transactions on Cloud Spanner in less than a month, and the average latency of less than 10ms is a testament to the robustness of Google Cloud services. We are happy with our Cloud Pub/Sub experience, where performance latency is minimal even when we see spikes of thousands of writes per second.
Solid Support, Collaboration with Google for Startups
By working closely with our Startups team at Google Cloud, we had access to the Google Cloud services and expertise to address most issues before the launch. The Startup team feels like an extension of ours. Their responsiveness is incredible—and stands out compared to support services we’ve seen from other providers. One of the main reasons we were able to re-platform our entire environment in only three months was because of Google Cloud’s amazing support.
Our participation in the Startup Program by Google Cloud for Startups has been instrumental to our success, and we know it will impact what we can achieve going forward. We are planning to continue to optimize workloads as we scale on Google Cloud. At the same time, we are thankful to have access to Google Cloud partner, Bespin Global; their team has provided vital load testing support and cloud resource optimization.
Another advantage of our participation in the Startup Program is the ability to easily tap into theGoogle Cloud Marketplace. It’s giving us a great way to get products into the hands of customers quickly. Having our products available on Marketplace is a frictionless experience for us, as well as for our customers who want a fast, self-service channel to get the tools they need.
Big Wins for Performance, Scalability, Security, and Throughput
The new Nylas Platform achieved all of the goals that we set out to accomplish.
Performance: P90 performance improved by orders of magnitude, where the end-to-end transaction can be completed in less than 10 seconds.Costs: We estimate that by re-platforming our environment on Google Cloud, we cut our ongoing infrastructure costs by as much as 30X.Throughput: Throughput (proportional to cloud cost) is improved by an order of magnitude on the same set of hardware.Scalability: Load tested to 100x of current traffic peak. The elasticity of the platform is completely automated and not bottlenecked by any of the technologies selected.Security: Data is short-lived at rest and in memory. The duration of storage is configurable based on the client’s security requirement and is a cost-based decision.
Elasticity of the system was well-tested during Amazon Prime Day. There was a surge of emails that occurred, which increased our server load from tens of servers to thousands of servers to handle the requests. We found the latency to be consistent as load spiked by three orders of magnitude.
What’s Next for the Nylas Platform
By collaborating with Google Cloud to better meet the needs of our enterprise customers, Nylas is positioned to expand our product and feature offerings beyond connectivity solutions.
Our roadmap includes partnering further with Google Cloud on advanced AI/ML use cases, such as extracting and surfacing valuable insights contained within communication channels like email, calendar and SMS. By mining these rich data resources, we’ll continue to empower development, engineering and product teams to launch high-value features faster with tools like automated workflows, expedited security reviews, and end-to-end user experience components.
Our success underscores our commitment to being agile and aggressive in bringing innovations to market. We’ve built a culture around trying new things and achieving ambitious results. Through our partnership with Google Cloud, we have a highly secure, scalable infrastructure in place to continue making the world more productive.
If you want to learn more about how Google Cloud can help your startup, visit our pagehere to get more information about our program, andsign up for our communications to get a look at our community activities, digital events, special offers, and more.
Cloud BlogRead More