When developing applications, you often need to implement a counter to accurately track actions such as votes cast, the available quantity of a resource in an eCommerce store, or tickets available for an event. These counters must be updated as the resource quantity changes.
In this post, we explore seven approaches to implementing resource counters with Amazon DynamoDB, a serverless NoSQL database service for single-digit millisecond performance at any scale. Failure modes, cost, and complexity are considered for each approach.
The challenge of counters
When using a counter, you want it to be accurate even when multiple concurrent processes are updating it. Loss of accuracy could be painful. For example, undercounting could result in overselling, making it impossible to fulfill all customer orders.
There are situations where undercounting or overcounting is tolerable, as long as it’s understood and included by design. For example, it might be acceptable to overcount orders as protection against the business overselling, as a physical resource count can be used to correct the count later.
It might also be necessary to stop consumption if the available quantity of the resource reaches a threshold (frequently zero). If this is a requirement, it’s important to avoid any race conditions that could result in some processes updating the counter after the threshold is reached.
All of the above should be considered when implementing a resource counter with DynamoDB. In particular, how the database will behave under failure scenarios at scale should be understood. Solutions that ignore what happens when failures occur, focusing only on success, increase the potential for issues in production.
Failure handling in DynamoDB
DynamoDB is a highly available, durable, distributed database. As with any distributed system, infrequent, intermittent failures can occur during operations and must be accounted for in any application.
An operation to update a counter might experience a service failure (identified by a 500-series HTTP response code being returned by DynamoDB). Failures can occur for several reasons, including a node rebooting or being replaced, or a transient network problem. The issue can occur either before or after the counter was updated, and it’s not always possible for the client to reliably differentiate between the two cases.
This type of failure is carefully considered in each of the seven approaches for managing counters presented in this post:
Optimistic concurrency control
Optimistic concurrency control with history
Transaction with a client request token
Transaction with a marker item
Counting with an item collection
Counting with a set
Approach 1 – Atomic counters
The first approach to managing a counter is uncomplicated and incurs the lowest cost, but provides lower accuracy compared to other approaches. It’s only suitable when an approximate counter can be tolerated.
The atomic counter approach uses an UpdateItem call to increment or decrement the counter. On receiving a 500-series error, the client can choose to retry the call or not, depending on if overapplication or underapplication is preferable in an ambiguous situation (by default, calls made by the SDKs will automatically retry unless configured otherwise). UpdateItem calls are naturally serialized within DynamoDB, so there are no race condition concerns with making multiple simultaneous calls.
The following AWS Command Line Interface (AWS CLI) invocation demonstrates this pattern; it subtracts five from the counter on each run.
If a 500-series error is returned, there’s no reliable way to determine whether the error occurred before or after the update. One strategy to handle this is to always retry after a failure if overapplication is tolerable, or never retry if underapplication is tolerable.
If a threshold needs to be enforced, a condition expression can be included to ensure the threshold won’t be exceeded. The following AWS CLI example reduces the counter by three on each invocation, as long as doing so won’t reduce the counter below zero. For this operation, the value of :threshold is three (the threshold plus the change):
DynamoDB applies these serially, checking the condition and applying the update before processing the next request. Adding a condition expression doesn’t increase the cost of an atomic counter, and doesn’t remove the uncertain outcome from 500-series failures.
Approach 2 – Optimistic concurrency control
Optimistic concurrency control (OCC) is a design pattern that changes an item only if the item is currently in a given state. OCC can improve the accuracy of an atomic counter by making it possible to identify situations where the update definitely did or did not happen.
The general design when using OCC is for a client to get an item, then update that item, but only if a specific attribute (such as, version, timestamp, entity tag, or resource counter value) is unchanged from the time of getting the item.
Let’s look at a practical example. The following AWS CLI example updates the counter to 5, with a condition that this should only apply if the value of the entityTag attribute is currently a458fc3d (this was the value from getting the item). The example also updates entityTag to a new random value of 781d45ac.
If a failure occurs, you can get the item again. If you see that the value of entityTag hasn’t changed, you know with certainty that the update didn’t happen and can safely retry. If you see that entityTag was changed to the new value, you know with certainty that the update happened and there is no need to retry. However, if entityTag is a third value, it indicates another client has changed the value of entityTag at around the same time and it’s still unclear if your update applied or not. The following approaches address this challenge.
To construct this OCC operation, the current value of entityTag must first be retrieved. This will require a GetItem operation.
For OCC, the process has to complete the full cycle of GetItem and then constructing and applying the update (with retries if necessary) before any other competing process makes a change. After a process makes a change, the condition will fail for any other writing processes and they will have to restart the loop, starting with GetItem.
As the concurrency increases for a counter, the number of processes having to repeat the loop due to condition failures will also increase. This contention can be self-destructive; at high concurrency, more processes are in this two-step loop, leading to more contention and more processes entering repeat loops. When this happens, the number of processes in a repeat loop can outnumber those exiting by successfully writing and moving on. If the counter isn’t given some respite, this will lead to an ever-growing backlog.
If a threshold is being enforced for the counter, the process can check if the change is acceptable after the GetItem operation. If the change would breach the threshold, the process can be stopped.
The cost impact of OCC compared to an atomic counter is an additional read for each update loop, plus any repeated loops, making OCC slightly more expensive.
Approach 3 – Optimistic concurrency control with history
OCC can be enhanced to maintain a history of which processes successfully updated the counter. This additional information will help a process to determine when to perform a retry.
In this example, we create a basic implementation and instead of replacing the entityTag value with a new one, we append the new value to the existing value in a string. This allows the entityTag to be examined to check what recent writes have succeeded and in what order.
If the process experiences a 500-series failure, it can examine the entityTag after a GetItem operation and check if the random value it tried to append is present. If it is, the update was successful. If not, the process should continue the retry loop.
As more processes successfully update the counter, the entityTag will grow in size. In our example, the entityTag stores the successful writers in sequential order. To limit the growth, the client should limit the maximum number of individual values in the entityTag. When this limit is reached, the writing process can remove an entry from the start of the entityTag (the oldest successful writer) as well as adding its own value at the end. The same approach can be implemented using a list instead of a string.
With the entityTag, this approach moves closer to being an accurate counter. The removal of the oldest successful writers could still create an edge case where ambiguity exists. At high scale, so many writers are making changes that every entry in the entityTag might have changed by the time a writer perfoms a GetItem to check if it successfully wrote. This writer is then in an ambiguous situation.
This approach doesn’t remove the potential for OCC to cause an ever-growing backlog at scale due to contention and a two-step repeat loop. The cost of this approach is the same as the basic OCC.
The most straightforward approach to maintain a fully accurate counter with multiple concurrent client calls and potential intermittent failures in DynamoDB is to use transactions. There are two approaches with transactions; using a client request token (CRT) or with a marker item.
Approach 4 – Transaction with a client request token
Transactional writes in DynamoDB allow a CRT to be passed. Any value can be used for the CRT, but it should be unique to the particular update being made (such as a random value generated by the client). Using the SDK will automatically generate and send a CRT. The following is an AWS CLI example that reduces the counter by seven with a CRT:
DynamoDB maintains an internal record of the CRTs of successful requests passed within the last ten minutes. If the same CRT is seen again within 10 minutes, DynamoDB will skip the latest attempt and return a success response, providing full idempotency. If the client receives a 500-series failure, it can safely retry the operation with the same CRT and know only one request with that CRT will apply within any 10-minute window.
If there’s a threshold to be enforced, transact-items.json can include a condition expression. The following is the previous CLI example with the addition of a condition that the quantity must not drop below zero:
The cost of a transactional write in DynamoDB is double the equivalent non-transactional write. In this example, one item is updated, making this option double the cost of an atomic counter.
Approach 5 – Transaction with a marker item
If idempotency must extend beyond the ten-minute CRT window, the equivalent functionality can be implemented manually, without a time constraint, by having the writer process track processed tokens and prohibit their reuse.
Within a transaction, the client must apply two actions:
Update the counter (as long as doing so won’t breach a threshold set by an optional condition).
Put a marker item that uniquely identifies this update operation (effectively a CRT) into a table to mark the action as completed. The marker item can be stored in the same table as the counter, or in a separate table.
The partition key of the marker item is the equivalent of a CRT and should be unique to the particular change being made. It must also have a condition to check that it doesn’t already exist in the table.
This following example reduces the counter by five, with a threshold, and stores the marker item in a separate table:
where transact-items.json contains:
In the event of a 500-series failure, it’s safe to retry this operation as it’s idempotent. If the threshold has been breached or this operation has already made this update in a previous iteration, the transaction will be explicitly rejected with a TransactionCanceledException. To verify what caused the rejection, check for the existence of the marker item. If it exists, this process successfully made the change. If not, the threshold blocked the change.
The marker item can include attributes to record why this change to the counter was made. For example, if the details of a customer order are going to be stored, they can be added as additional attributes in the marker item instead of being stored in a separate item.
In this example, there are two items being written in a transaction, making it the most expensive of the approaches at double the cost of using a CRT. To avoid a build-up of marker items in the table, a Time to Live (TTL) can be set to expire them later.
Approach 6 – Counting with an item collection
This next approach simplifies writing at the expense of reading, and is unique in that it works with multiple Regions of a global table.
It’s possible to replace a single counter item with an item collection of events, which can be used to calculate the counter value. This design is typically how a ledger would be implemented and is based on the idea of event sourcing. Each increment or decrement becomes its own item, with a unique ID, and can be written idempotently. These items can later be summed on the client-side to calculate the counter value. This approach doesn’t work well with threshold enforcement.
An item collection in DynamoDB is the formal name for a collection of items sharing the same partition key (PK) value but having different sort key (SK) values. Each item collection represents a counter with the PK as the unique ID for the counter. The SK stores a unique value (a request token) generated by the client to represent a particular update. The amount of the increment or decrement is stored as an attribute. The following example increases a counter with ID customer1-balance by three with a request token of a458fc3d:
If a 500-series failure occurs, it’s safe for the process to retry without any risk of overapplication as the item will be over-written if it already exists. This operation is idempotent and the counter will remain accurate at scale.
The current value of the counter isn’t explicitly stored in this approach. To obtain the value, the item collection must be queried and all of the returned items summed by the application. DynamoDB charges for queries by the total data returned, rounded up to the nearest 4 KB, regardless of the number of items in the returned item collection. The cost of querying the counter value will therefore increase as the item collection grows over the 4 KB boundary. The time taken to obtain the value will also increase as the results start to paginate and a return trip to the DynamoDB API is required for each page.
The time and cost of querying the counter might make this method less attractive, but can be alleviated by caching the calculated value and using the cache for a period of time, with a trade-off that the value becomes more stale than with other approaches.
The cost of storing and querying the counter can also be alleviated by merging entries (if individual entries don’t need to be retained for visibility). Multiple entries can be deleted and replaced by a singular entry containing their net change. This data modification should be completed carefully and use transactions to ensure it’s performed without any possibility for error.
If it’s helpful to obtain the changes to the counter in order (that is, to show a log of changes), the SK can be prefixed with a timestamp. It’s important to keep a random element as a suffix to avoid two changes being performed at exactly the same time being confused for a singular change. If the changes don’t have to be in order, using a truly random SK would allow for more concurrent writes to the item collection because it would enable split for heat to apply.
Given that there is no storing, checking, and modifying the current counter value on single items, it’s not possible to reliably enforce a threshold, although it would be possible to observe a threshold’s crossing and take action.
Approach 7 – Counting with a set
If the total number of updates to the counter can be limited, a set can be used to implement an accurate counter. An example would be allocating a maximum number of slots and tracking how many available slots are remaining (for example, adding a limited number of players into a multiplayer game).
The writing process generates a unique value (a request token) to represent the update being made and attempts an insert into a set attribute as part of the update. The native set functionality in DynamoDB can be used to check if this token already exists in the set and reject the request if it does.
A CLI example using a string set with a maximum size of fifty, a request token of “a458fc3d” and increasing the counter by one:
If a 500-series failure occurs, it’s safe for the process to retry without any risk of overapplication as the check of the token’s existence in the set prevents repeat application. This counter will remain accurate at scale.
If an update is rejected with a condition failure, the process must check if the token it tried to insert exists in the set by performing a GetItem operation and inspecting the values in the set. If the value exists, the rejection was because the update was already successfully performed, otherwise the update cannot be made due to the maximum size of the set being reached.
If the counter has a threshold to be enforced and the change made by each operation is one, the set maximum size is the threshold. If the change is variable, the condition expression can be extended to include a threshold check.
As the number of entries in the set grows, the total size of the item will increase. DynamoDB writes are charged by rounding up to the nearest kilobyte. As this item grows over kilobyte boundaries, the associated write costs will increase. If the total number of updates (the set size and the growth) is sufficiently limited, this is likely the most cost effective approach to implementing an accurate counter at scale. The 400 KB size limit for any single item in DynamoDB will be a hard constraint.
Once either the set maximum or the 400 KB size limit is reached, the counter can no longer be updated. A possible escape from this situation is to delete items from the set. A set is naturally unordered, making this a more complex process than deleting from a list or a string. If deleting markers is a requirement, OCC with history (Approach 3) is likely a simpler and more appropriate approach.
If additional detail of why the counter was changed is to be stored, approaches enabling an item collection with a Transaction (Approaches 4 and 5) would likely be more appropriate.
Scaling a counter
DynamoDB can achieve nearly unlimited traffic with all of the above approaches. If transactions are used at scale, transactional conflicts can occur. These conflicts can be resolved by retrying. As the load on the counter grows, the volume of transaction conflicts will likely increase with an associated increase of retries.
All of the approaches, with the exception of approach 6 (Counting with an item collection), write to a single item. These single item approaches can reach limits. If processes reach these limits, or if the volume of retries is disruptive, write sharing can be used to improve the throughput for these counters.
If the stock for a popular product is being taken from a bin in a warehouse to fulfill orders, and having only one bin is a bottleneck, the logical solution would be to split the stock across multiple bins and work them in parallel. This is conceptually the same approach as write sharding. In write sharding, the counter is split across multiple items and the items are processed in parallel.
As a practical example; a product quantity of 1,000 with stock code abc123 would be split equally between N items in the table, adding a suffix to the end of the partition key for each item. With an N of 10, this would create ten items for the counter. The first would have a partition key of abc123-0, the second would be abc123-1 (and so on), with the last being abc123-9. Each would have an initial quantity of 100 (1,000 divided by N).
During an operation to modify the counter, the writer would randomly select one of the items to update. This would distribute the updates across the items, improving throughput.
Implementing write sharding might mean that the client needs to be enhanced to understand that if an item cannot be updated due to a threshold, one of the other items might have capacity to accept that update, or the update quantity might have to be split across several items. Understanding the total value of the counter also requires reading all of the items and summing the parts.
Working with the AWS SDK
The AWS SDK for DynamoDB implements automatic retries for some situations. This has implications for the approaches detailed that aren’t idempotent or retry safe. Automatic retries by the SDK might lead to overapplication of an update. Disabling SDK retries with these approaches might lead to underapplication unless the application explicitly handles retrying.
For approaches 2, 3, and 7 (Optimistic concurrency control, OCC with history, and Counting with a set), automatic retries by the SDK can result in the update successfully applying, but a retry then returning a condition failure, leaving the application to determine what happened and what next steps to take.
For approaches 4 and 6 (Transaction with a client request token and Counting with an item collection), retries by the SDK will be transparent, as DynamoDB will return a success response for any retries of an operation that have successfully applied. There are no implications for the application. The SDK also automatically generates and sends the CRT for approach 5. This makes these approaches simpler to work with.
For approach 5 (Transaction with a marker item), retries by the SDK can result in a TransactionCanceledException, leaving the application to determine what happened and what next steps to take.
An application might need a counter to exist in more than one Region. In this case, DynamoDB global tables can be used. DynamoDB global tables are an active-active (multi-writer) solution. There are two approaches to working with a global table; writing in one Region and reading from many, or reading and writing to many Regions. The impact of these two options on the approaches to managing a counter are explored next.
If the solution is writing exclusively to a single Region but reading from many, you can use one of the approaches detailed in this post to update the counter. The accuracy of the counter is determined by the approach chosen. Any process reading from a Region other than the writing Region might obtain a counter value that is a bit more stale than if read from the writing Region, as the updates to the counter propagate, but this propagation is usually completed within a second.
If the solution is writing to multiple Regions, only approach 6 (Counting with an item collection) will maintain an accurate counter. This is because DynamoDB global tables use a last-writer-wins approach to resolve conflicting writes to the same item. Let’s illustrate this issue with an example. Some counter exists in Region A, B, and C. It initially has a value of four. A write in Region A reduces this by one (to three). A few milliseconds later, a write in Region B reduces this by two (to two) before the write in Region A has propagated. The correct calculated value for the counter is one, but the last writer wins approach will see this conflict and set the counter to two in all Regions.
Approach 6 doesn’t have this problem as there are no conflicting writes to the same item. Every change to the counter is a write to a unique item and will ultimately propagate to every Region. Changes will not be lost. If the process of merging items is implemented, the counter can be temporarily inaccurate as the individual writes and deletions are propagated (possibly at different rates and out of sequence), but this will eventually correct itself. Just ensure that the merging work is limited to one Region at a time.
With any approach, writing to multiple Regions will have a counter value that could be a bit stale in every Region as writes propagate between Regions.
In this post, we explored implementing a resource counter with Amazon DynamoDB. You learned about seven approaches, including failure modes, cost, and complexity for each. The characteristics of each approach are summarized in the following table:
Simple and low cost. Appropriate to use when a guaranteed accurate counter isn’t required. The SDK will automatically retry by default during 500-series failures, so might over apply.
Optimistic concurrency control (OCC)
Greater accuracy than atomic counters. Adds some cost and complexity.
Optimistic concurrency control with history
Greater accuracy than OCC, but with more complexity.
Transaction with a client request token
An accurate counter. Simple to implement but with a bit more cost.
Transaction with a marker item
An accurate counter. The best choice only if a retry window greater than ten minutes is required.
Counting with an item collection
The only option for global tables. Writing scales up without needing to implement write sharding. Obtaining the counter value is an application-side effort.
Counting with a set
The least expensive accurate counter. Requires that the total number of updates to the counter can be limited.
To further explore some of the topics covered in this post, you could take a deeper dive into transactions, write sharding, and global tables.
If you have any questions or feedback on this post, leave a comment below.
About the Authors
Chris Gillespie is a UK based Senior Solutions Architect. He spends most of his work time with fast- moving “born in the cloud” customers. Outside of work, he fills his time with family and trying to get fit.
Jason Hunter is a California-based Principal Solutions Architect specializing in DynamoDB. He’s been working with NoSQL Databases since 2003. He’s known for his contributions to Java, open source, and XML.
Read MoreAWS Database Blog