This post was co-authored with Nissan Modi, Staff Software Engineer at Coinbase.
In this post, we discuss how Coinbase migrated their user clustering system to Amazon Neptune Database, enabling them to solve complex and interconnected data challenges at scale.
Coinbase’s mission is to expand global economic freedom by building an open financial system that uses cryptocurrency to increase access to the global economy and aims to achieve this by providing a secure online platform for buying, selling, transferring, and storing cryptocurrency. Coinbase handles vast amounts of data related to transactions, user behavior, and market trends. To efficiently manage and analyze this data, the company employs various data science techniques, including clustering. This method of data organization and analysis is particularly useful for a platform like Coinbase, which needs to understand patterns in user behavior, detect anomalies, and optimize its services. Clustering groups similar data points together based on their features by sorting entities into sets, or clusters, where items in the same cluster share key traits. These traits could be things like age, habits, or interests. The main idea is that data points in one cluster are alike, and those in different clusters are not. This method helps find natural patterns in data, making it straightforward to understand and use large datasets.
Challenge
The platform organization within Coinbase has been responsible for managing a clustering system that has been in place since 2015. Since the original datastore for the clustering system was not graph-based, clusters needed to be precomputed and stored in a NoSQL database. At any given time, the system would store approximately 150 million clusters, some of which contain over 50,000 nodes. This made it challenging to keep clusters up-to-date as user attributes changed in real time, as whenever a user attribute was updated, the system would need to re-calculate clusters.
Pre-calculating clusters became even more challenging as Coinbase expanded their product offerings and their customer base grew. Additionally, logic for grouping users became increasingly complex over time. This necessitated a high number of database updates to support each specific use case. As a result, the system began to experience performance degradation, higher storage costs, and difficulties in supporting different read patterns. This growing inefficiency made it clear that the existing approach was no longer sustainable.
The scale of the system was significant, with around 150 million clusters, some of which included over 50,000 nodes. This massive scale added to the complexity and challenges faced by the team, especially as the system’s write-heavy nature became more pronounced over time.
Initially, the system relied on a NoSQL database to store the precomputed clusters. Precomputing results can be advantageous in systems that are primarily read-heavy, because it avoids the need to repeat the same computations during read operations. However, the clustering system at Coinbase was characterized by a write-heavy workload with frequent updates, making precomputing less optimal as the system evolved. This led to performance issues, increased storage costs, and challenges in accommodating the complex and dynamic relationships within the data. Consequently, the team needed to reevaluate the database choice to better scale the system and meet the demands of Coinbase’s growing product ecosystem.
Solution overview
Graph databases are designed to manage complex, interconnected data structures, allowing representation and querying of relationships between entities. Because Coinbase’s use case is write-heavy and the data is highly connected, they needed a solution that can handle frequent updates to both data and relationships. Instead of relying on precomputed joins, a graph database can perform real-time traversals of connections and relationships, leading to improved query performance and reduced storage costs as compared to using a non-graph datastore to solve the same problem. Adopting a graph database for Coinbase’s clustering system represents a strategic shift towards a more flexible and scalable data architecture, which is key as Coinbase grows not only their customer base but the increasing complexity of its customer relationships.
Graph databases are purpose-built for storing and efficiently querying highly connected data. The following are key indicators of whether a graph is well-suited for a particular use case:
- Is the dataset highly connected, full of many-to-many relationships?
- Do your intended queries for the graph require understanding relationships between data points?
Coinbase’s clustering use case aims to group entities according to attributes shared across the entities. Therefore, when clustering is complete, entities within a single cluster will be more closely associated with one another, compared to other entities that are in different clusters.
You can represent the dataset using a series of relational tables, for example, a user_attributes
table where each row represents a user, and each column represents a different attribute, as illustrated in the following figure.
You can also model it as a graph, as shown in the following figure.
The benefit of modeling this data in a graph format is that you can efficiently find groups and patterns based on mutual connections. For example, given the following sample graph of entities (ent#) and attributes (attr#), you might want to find the collection of entities that share certain attributes but not others. A shared attribute is defined as an attribute node that is connected to two or more entities.
More specifically, let’s say you want to find a collection of entities that meet the following requirements:
- All entities in the collection share at least attributes
attr1
,attr2
,attr3
- All entities in the collection do not share the attributes
attr4
,attr5
- All entities in the collection share any attribute with at least one other entity that shares a specific attribute with at least two other entities
And your graph contains the follow entities and relationships:
With this example, only ent1
and ent2
would be returned, since ent1
and ent2
both connect to attr1
, attr2
, and attr3
, meeting the first requirement. ent1
is connected to attr5
, and ent2
is connected to attr4
, but both attributes are not shared attributes since they don’t connect to more than one node each – thus meeting the second requirement. And both ent1
and ent2
share attr1
, which is also shared by ent4
, ent5
, and ent6
– thus meeting the third requirement.
To answer this question efficiently, you need to know not only how entities are connected to the attributes they are associated with, but a way to traverse those connections across multiple levels. Although this question can be answered with a relational database, for query performance to be efficient, you should know all your query patterns upfront, so table joins can be pre-calculated and stored. But by keeping this data in a graph, it not only lets you recalculate queries in real time as the data changes (with no need to pre-calculate joins), but also gives additional flexibility for different query patterns to be written as needed.
Neptune Database addresses several technical challenges faced by large-scale graph database implementations. Because it is fully managed, Coinbase can eliminate significant operational overhead while providing flexibility in data modeling and querying. Neptune Database doesn’t enforce a schema, so adding new properties, node types, and edge types to answer evolving business use cases doesn’t require the graph to be rebuilt or reloaded. Additionally, Neptune Database is capable of querying billions of relationships with millisecond latency, allowing Coinbase to scale this system with their growing customer base.
In Coinbase’s solution, the data ingestor service writes to Neptune Database transactionally. Multiple events are batched into a single graph query, which is run as a single transaction within Neptune Database. This keeps the graph up to date in near real time with the incoming events. Coinbase micro batches multiple changes into the same transaction, and is therefore able to achieve their desired ingestion rates through 20 writes per second, where each write takes the place of many writes (depending on how many users’ clusters were being updated) in the old NoSQL system.
The following diagram illustrates the architecture for Coinbase’s enhanced clustering solution.
Services communicate with Neptune Database through an API server, where different use cases are mapped to different queries. For example, when invoked, the get-related-users
API takes an attribute name and attribute value and runs the following Gremlin query to retrieve information about a given user:
One feature that Coinbase was unable to implement with the legacy architecture was a UI for stakeholders that visualized the graph. Even though clusters could be pre-calculated, the results themselves were still stored in a tabular format. Now that the data is in a graph format, visualizations of the entities and relationships can be generated with ease. Providing a visualization allows stakeholders to see a different perspective of the data, and makes it straightforward to visually identify the common attributes used for generating clusters—enabling stakeholders to take the proper actions when linkages between common attributes are found. The following is an example visualization from Coinbase’s enhanced clustering system.
Representing graph data and querying
Neptune Database supports two open frameworks for representing graph data: the Resource Description Framework (RDF) and the Labeled Property Graph framework (LPG). You can represent graph use cases using either framework, but depending on the types of queries that you want to run, it can be more efficient to represent the graph using one framework or the other.
The types of queries that are commonly used for clustering in the Coinbase system require recursive traversals with filtering on edge properties and other traversals. Therefore, representing this use case with the LPG framework was a good fit because it’s simpler to write complex pathfinding queries using the LPG query languages openCypher and/or Gremlin.
For example, one benefit of using LPG and the Gremlin query language is the presence of support for a query results cache. Pathfinding queries that are used to generate clusters can have many results, and with the query results cache, you can natively paginate the results for improved overall performance. Additionally, to generate a visualization of subgraphs, you need to return a path, which is the sequence of nodes and edges that were traversed from your starting points to your ending points. You can use the Gremlin path() step to return this information, making it less complicated to generate paths for recursive traversals with condition-based ending conditions, such as finding the path between a given pair of nodes.
Benefits and results
Coinbase’s solution with Neptune Database yielded the following benefits:
- New use cases – The new solution facilitates the discovery of related users across various product use cases without the need for hard-coded aggregation logic. Additionally, attribute lists can be passed to the
get-related-users
API to instantly generate a list of related users. This capability aids in debugging and allows for the efficient identification of similar users for administrative purposes. - Performance efficiency – 99% of the queries that Coinbase runs achieves a latency of less than 80 milliseconds for the platform team while running on a smaller, cost-optimized instance, without a caching layer. This instance can scale to 300 transactions per second (TPS). These transactions are more meaningful than TPS figures on the previous NoSQL system, due to batching the writes and updating all of the users’ attributes across multiple clusters. Because computing multiple joins was required, the NoSQL system thus needed multiple queries to find the same results that a single graph query now finds.
- Reliability – Because updates are now limited to a single node, the number of database operations has been drastically reduced. This optimization has effectively eliminated the race conditions that were prevalent in the previous system. Additionally, Coinbase can take advantage of automatic hourly backups through Neptune Database.
- Cost optimization – Coinbase was able to achieve 30% savings in storage costs by eliminating redundant information in the old system and computing the clusters at runtime using Neptune Database.
- Visualizations – New visualization capabilities provided through a custom-built UI help business owners and teams across the company understand fraud and risk situations and allow new ways to show useful data. This has already significantly reduced analysis time.
Conclusion
Coinbase’s journey with Neptune Database showcases the power of graph databases in solving complex, interconnected data challenges at scale. By migrating their user clustering system to Neptune Database, Coinbase has not only overcome the limitations of their previous NoSQL solution but also unlocked new capabilities and efficiencies. The fully managed nature of Neptune Database has allowed Coinbase to focus on innovation rather than operational overhead. The platform’s ability to handle billions of relationships with millisecond latency enables Coinbase’s future growth and evolving business needs.
Now that the data is in a graph format on Neptune Database, it’s less complicated for Coinbase to add more user attributes without increasing the complexity of managing the relationship. In the future, they plan to ingest more of these attributes and gain richer insights. This will lead to even greater benefits and new use cases.
The graph format also makes it straightforward to extend analyses to experiment with new graph-based techniques. Neptune Analytics is a memory-optimized graph database that helps you find insights faster by analyzing graph datasets with built-in algorithms. Graph algorithms can be used to identify outlier patterns and structures within the dataset, providing insights on new behaviors to investigate. A Neptune Analytics graph can be created directly from a Neptune Database cluster, making it effortless to run graph algorithms without having to manage additional extract, transform, and load (ETL) pipelines and infrastructure.
Get started today with Fraud Graphs on AWS powered by Neptune. You can use sample notebooks such as those in the following GitHub repo to quickly test in your own environment.
About the Authors
Joshua Smithis a Senior Solutions Architect working with FinTech customers at AWS. He is passionate about solving high-scale distributed systems challenges and helping our fastest scaling financial services customers build secure, reliable, and cost-effective solutions. He has a background in security and systems engineering, working with early startups, large enterprises, and federal agencies.
Melissa Kwok is a Senior Neptune Specialist Solutions Architect at AWS, where she helps customers of all sizes and verticals build cloud solutions according to best practices. When she’s not at her desk you can find her in the kitchen experimenting with new recipes or reading a cookbook.
Read MoreAWS Database Blog