Amazon Quantum Ledger Database (Amazon QLDB) is a ledger database that provides a complete and verifiable history of all transactions committed to the ledger.
In Amazon QLDB, data records are called documents and are versioned. Documents use a JSON-like format called Amazon Ion. Changes to documents are called revisions and are written to the ledger in transactions. The ledger is built upon an append-only log called a journal. As transactions are committed to the ledger, they’re organized into blocks and appended to the journal, with each block containing one transaction. Once committed and appended, blocks can’t be modified or overwritten, which gives you an immutable record of every insert, update, delete, and select ever committed to the ledger and access to every revision of every document. To prove that the transaction history is immutable, Amazon QLDB provides a cryptographic verification feature that enables you to mathematically prove the integrity of your transaction history. In this post, we discuss the value of cryptographic verification in a ledger in the context of a realistic use case.
Cryptographic verification use case
Cryptographic verification is useful for cases where a business may need to provide transparency by proving the integrity of its data to another party, such as a regulator, law enforcement, auditors, the public, a business partner, or another party in a lawsuit. Verification proves that the transaction history has not been altered, tampered, or falsified. The Amazon QLDB console provides an easy-to-use feature for verifying a single document revision. However, using the console for verification is a manual process and only verifies a single document revision. You might wish to automate verifications or integrate verification functionality into your application.
To illustrate, let’s imagine that we’re a television production company hosting a live televised singing contest where contestants advance or get eliminated based on votes submitted by viewers through a mobile phone application. The fairness of game shows is serious business and most production companies use an auditing firm to review their voting process and the outcome. However, it would be easy to accuse the producers of tampering with the results to favor contestants that might boost the show’s ratings. Tampering of data directly in the database would be hard for auditors to detect, so there’s still room for scandal. If the votes are stored in an Amazon QLDB ledger, they have a complete record of every transaction ever committed, so there would be no way to hide any alleged contest-fixing and the integrity of the data could be proven with cryptography.
In this post, we look at how cryptographic verification can help our production company maintain its audience’s trust in the voting process and its results. Before we dive in to our example use case, let’s understand the concepts behind cryptographic verification.
Sending data through a cryptographic hashing algorithm produces a fixed-width output called a hash. A cryptographic hash is like a fingerprint of data. If even a single bit is changed in the input data and the hash is recalculated, the algorithm produces a very different hash. By comparing hashes of data calculated at different points in time, it’s easy to tell if the data has changed in the interim: the hashes are different! When data is written to the ledger, Amazon QLDB calculates a hash of the data using the SHA-256 hash algorithm and stores it with the data. At a high level, verifying data in Amazon QLDB is done by recalculating the hash for the data you want to verify and comparing it to the hash that was stored when the data was written.
Let’s examine a very simple example of hash comparison using Java. In this example, we calculate the SHA-256 hash of some example data. Then we simulate data tampering by changing one character in it. To perform a verification, we recalculate a SHA-256 hash for our data and compare it to our original hash. If they’re the same, we’ve proven the integrity of our data. If they’re different, we know that the source data has changed. See the following code:
In the output for this program, the hashes are different, indicating that the data has changed:
What if we wanted to verify the integrity of an entire database? We could apply the cryptographic hash technique we described to every record in the database, testing the integrity of each, as shown in the following diagram.
However, this approach has a couple of flaws. First, an attacker with low-level access could tamper with the data and its hash and it would be very difficult to discover. Second, this approach verifies each record individually, not the integrity of the database as a whole.
To solve these problems, we use hash chaining. Hash chaining is the process of applying a cryptographic hash algorithm to another hash, creating a hash of a hash. If we arrange the records in our database sequentially and make the hash of each data record depend on the hash of the record before it, then changing the data in one record affects the hash of that record and the hashes of every other record that occurs after it in the database. To verify the integrity of the entire database, we recalculate and compare the hashes of all records, starting from the first record and continuing until the hash of the last record is verified or until we discover an invalid hash. This proves not just the contents of each record, but also that each record is where it belongs in the sequential order in the database. The following diagram illustrates our hash chaining process.
Recalculating the full hash chain during a verification is inefficient, and requires more and more time and compute cycles as the journal grows. To make verifications more efficient, we can organize our hashes using a Merkle tree. A Merkle tree is a binary tree data structure whose leaf nodes contain a single data hash and whose non-leaf nodes contain a hash of their two child node hashes.
Verifications using a Merkle tree involve recomputing the hash of the data you wish to verify, concatenating that hash to the hash of its sibling node in the Merkle tree, and hashing the result. That hash becomes the value of the parent node. The parent node’s hash is then combined with the hash contained in its sibling node in the Merkle tree, and the result is hashed to produce the hash of the grandparent node. This process continues until the two nodes below the root of the Merkle tree are combined and hashed together. The resulting hash should equal the hash contained in the root node. Verifications performed using Merkle trees are much more efficient than performing a linear validation of every record in the database.
Because each node in the Merkle tree is a hash that represents everything below it in the tree, the root of the Merkle tree represents the entire database up to that moment in time. Verifying the integrity of data using a Merkle tree proves the integrity of that data and the database as a whole.
The following diagram illustrates our Merkle tree.
Let’s imagine that we want to verify data record d2 in the preceding diagram. The database in the diagram contains six data records, labeled d0 through d5. Each data record has a corresponding hash value, labeled h0 through h5. We start by calculating the hash of the data in d2. There’s a specific way of doing this; we cover the details later in this post. We combine the hash we computed for the d2 data with hash h1 and we hash that, the result of which should equal h2, which Amazon QLDB computed and stored for us when d2 was written.
Now we continue up the Merkle tree. The next step is to combine sibling hashes h2 and h3 and hash the result, giving us m0.2. We repeat the process, combining sibling hashes m0.2 and m0.1, giving us m1.0. We repeat this once more, concatenating hashes m1.0 and m0.3 and hashing the result. If the resulting hash equals hash m2.0, our database is valid. However, if even a single bit is changed in records d0 through d5, verification fails.
One complication of using the Merkle tree to perform validations is that you just don’t have some information unless you construct the entire Merkle tree yourself, which defeats its purpose. Specifically, you don’t have the sibling hashes at each level in the tree and you don’t have the final hash, the hash at the root of the Merkle tree, to compare to. Luckily, Amazon QLDB provides these for you. Amazon QLDB calls the sibling hashes proof hashes and it calls the hash at the root of the Merkle tree the ledger digest. In our verification example, hashes h3, m0.1, and m0.3 are the proof hashes provided by Amazon QLDB in that order. They’re colored in green in the following diagram. Hash m2.0 is the ledger digest, colored in blue.
The data nodes colored in grey in the diagram are called blocks in Amazon QLDB, and the ordered collection of blocks is called the journal. Each block represents a single transaction committed to the ledger, and is a complex structure that contains all the document revisions written in the transaction, transaction metadata, and other information. To calculate the hash of the block, we apply the concept of the Merkle tree to the block itself. The hashes of the individual components of a block are calculated and organized into a Merkle tree so that we can calculate its root hash, which becomes our block hash. We demonstrate this in the following sections.
Now that we understand the concepts behind cryptographic verification, let’s demonstrate how we can verify the singing contest’s votes.
Every time someone submits a vote for their favorite singer in our television show, we store a document in the votes table of our contest ledger. The document contains a unique vote ID, the phone number of the voter, the show episode, the contestant being voted for, and other information about the voter and their device. The table has indexes on the voteId and phone fields for efficient retrieval. The document looks like the following:
Let’s say that audience members can see a history of their votes for an entire season of the show, and the person that submitted the preceding vote thinks they voted for another contestant. They complain that their vote was changed and they take to social media to tell the world that the contest is rigged. As part of our investigation into the matter, we perform a cryptographic verification of the vote document to prove that it remains exactly as it was first written.
Verify the document revision hash
To perform the verification, we need several things. First, we need a copy of our vote document revision and its metadata from the ledger. The metadata contains important information like the address of the block where the document resides in the ledger as well as the cryptographic hash for the document revision that Amazon QLDB calculated and stored when the revision was written. To fetch our document revision and its metadata, we query the votes table’s committed view with the following query:
The document we get back looks like the following:
Because we queried the committed view, we get back a document where our vote data is now nested under the data field. This is why we selected our document by data.voteId instead of by voteId. The document also contains top-level blockAddress, hash, and metadata fields.
This document is our starting point in the journal for our verification. We start by verifying the document’s hash. To do that, we calculate individual hashes of the data and metadata sections. Then we combine them in a certain way and calculate a hash of the result. That hash should equal the hash already stored with the document.
The following snippet of Java code shows the overall process. Note that our document, which contains its metadata and other fields, is passed in as an IonStruct object.
The interesting bits happen in hashIonValue() and combineHashes(), which aren’t shown. Let’s dive into these.
When calculating a cryptographic hash of data, it’s important to do it the same way every time or you may produce different hashes from the same data. This is especially true when you’re dealing with a JSON-like data format like Amazon Ion where field ordering doesn’t matter and may change. Consider the following two example documents:
These two documents are equivalent from a data point of view. They contain the same fields with the same values. However, they’re not the same ordering of bytes, so their hashes calculate differently. To ensure that we’re calculating hashes for our Amazon Ion-formatted Amazon QLDB documents consistently, we use the Ion Hash library, which does the hard work for us. Our hashIonValue() method demonstrates how to use the library to calculate a hash:
Verifying a document revision hash requires concatenating the hashes of the data and metadata components of the document and then hashing the result. The concatenation must always be performed consistently to produce correct results. This is especially important later on as we verify our way up the Merkle tree, because the proof hashes are provided to us by Amazon QLDB as a list, without any reference to the tree itself. As we saw earlier, some proof hashes may be right-side siblings and some may be left-side siblings, but we won’t know this as we work our way up the tree. Therefore, we need another means of ensuring consistency as we combine hashes. We do this by sorting the hashes using a left-to-right bitwise comparison, always appending the greater hash to the smaller hash. This approach to combining hashes is unique to Amazon QLDB. Other Merkle proof systems require distinguishing left hashes from right hashes. Our combineHashes() code looks like the following code:
Get a digest
Now that we’ve verified the document revision’s own hash, we can verify the revision in the context of the ledger itself. To do that, we need a ledger digest and the proof hashes, the intermediate hashes that get us from our document hash up the Merkle tree to the digest. Amazon QLDB provides the GetDigest API action for getting the digest. The following Java code fetches the latest digest from the ledger:
The ledger digest is an Amazon Ion document that contains the digest hash as well as a pointer to the location in the journal’s transaction history where it was created. This pointer is called the tip address. The following is an example of a ledger digest:
Get the proof hashes
Now we can get the proof hashes. The proof hashes are the intermediate hashes in the Merkle tree between our document revision hash and the digest, so Amazon QLDB needs the block address of the revision and the tip address of the digest so it knows what proof hashes are needed. The following code calls the Amazon QLDB GetRevision API action to retrieve the proof hashes for our document revision and the digest we just fetched:
Recalculate the digest
We now have all the pieces we need to conduct a complete cryptographic verification of a document revision. The last step is to calculate our way up the Merkle tree to the top and compare our calculated hash to the ledger digest that Amazon QLDB gave us. The code to do that looks like the following:
We start with the verified document hash, combine it with the first proof hash in the list, and hash the result. We take that hash and combine it with the next proof hash in the list, hashing that result. We continue that process until we run out of proof hashes. The final hash should equal the ledger digest hash. If it does, we successfully verified the integrity of our document and of the ledger up to the point the digest was created. The following code snippet puts it all together:
We’ve proven the integrity of the audience member’s vote and can respond to their complaint on social media. Better yet, we can allow users of the mobile phone application to request a verification of a vote, displaying a visual indicator of the verification status next to their vote in the application.
Now that we understand the rudiments of cryptographic verification, we can expand our capabilities for building trust in our data. Besides verifying individual document revisions, we can also perform cryptographic verifications on whole journal blocks. Verifying a block uses the same techniques that we used to verify a document revision.
To illustrate, let’s imagine that our television production company uses an outside auditor to validate the results of our televised singing contest. Giving the auditors direct access to our votes database isn’t ideal. Instead, we can export our journal blocks into Amazon Simple Storage Service (Amazon S3) using the Amazon QLDB export feature, package them up, and deliver them to our auditors as a dataset. The blocks give the auditors not just the vote data, but also the transactional context and metadata, and the complete history of every insert, update, and delete performed in the database.
The auditors will naturally be concerned about the integrity of the data in the export, because the data has left the protective confines of the database. Because they have the complete journal blocks in the export, they can perform a cryptographic verification of the blocks in the export files. The export can be proven against the ledger itself using a ledger digest and proof hashes provided by Amazon QLDB.
An Amazon QLDB export creates a completed manifest file that contains an ordered list of paths to the exported data files, each data file containing one or more journal blocks. Verifying the export involves reading and verifying the blocks in order, then verifying the last block against a ledger digest. The blocks are hash-chained together, so the hash of each block is used to calculate the hash of the next block.
Verifying a block is a little different than verifying a document revision. The following code is an example of a block from our contest ledger:
The following diagram illustrates these parts of the block. At the top of the block is the address that specifies the location of the block in the journal. The address is followed by the transaction ID and timestamp (each block represents a transaction), the calculated hash of the block, and the hash of the previous block. Then we have the entriesHash and entriesHashList fields, which we explain later in this post. Next is the transactionInfo section that contains the PartiQL query statements that were run to create the transaction for this block and a mapping of the document revisions in the block to the database tables they reside in. The last section is revisions, which contains all the document revisions in this block’s transaction. The preceding block contains two document revisions.
To verify the block, start with the revisions section. Recalculate the hash of each revision and compare it to the hash stored with that revision, just as we did when we verified a revision earlier in this post. Next, iterate through the revision hashes, concatenating the first revision hash to the next revision hash and hashing the result, then repeating until the end of the list, just as we did with our proof hashes. This produces a final hash for the revisions section. That hash must appear in the entriesHash list in the top section of the block. Its location in the list doesn’t matter.
Next, calculate a hash for the transactionInfo section. We can use the hashIonValue() code that we described earlier in the article. For example:
The resulting hash must appear in the entriesHash list in the top section of the block. Its location in the list doesn’t matter. The list contains more than the two hashes we recalculated for the transactionInfo and revisions sections of our block. These other hashes represent the state of internal ledger components at the time the transaction was committed and don’t need to be recalculated for verification.
Next, calculate a hash for the entriesHash list element by processing the hashes in the list just as we did for the proof hashes. The resulting hash must equal the entriesHash field at the top of the block.
As we process the blocks from our export in sequence, the value of the previousBlockHash field at the top of the block must equal the blockHash field of the previous block in the export.
Next, combine the previousBlockHash with the entriesHash and hash the result. This should equal the blockHash field, which should also equal the previousBlockHash field from the next block in the export.
When we reached the last block in the export, we can verify it against a ledger digest. This proves that the exported blocks match the blocks in the database, giving the auditors trust in our export so they can conduct their audit.
The process for verifying the block against the digest is nearly identical to the process we followed earlier to verify a document revision against the digest. Fetch the digest using the getDigest() method we used earlier. To fetch the proof hashes for our block, use the GetBlock API action instead of GetRevision. Combine the block hash with the first proof hash and hash the result. Then proceed through the proof hashes just as we did previously until we have a final hash that should equal the digest.
Generating ledger digests on a regular basis and storing them away from the ledger is a good operational practice. First, it increases performance by reducing the number of getDigest() calls, because digests can be reused. Second, it provides additional insurance against tampering because the ledger digest, the final key to verification, is held outside of the service. An S3 bucket is a great place to keep regular ledger digests.
Cryptographic verification is a powerful mechanism for proving the integrity of data in an Amazon QLDB ledger. Verifiability instills trust in the data, and the use of a versioned database with an immutable and verifiable history of change demonstrates a commitment to transparency.
In this post, we demonstrated several ways to use cryptographic verification that are applicable to many different use cases. Verification can be performed in each of the programming languages supported by Amazon QLDB. The Amazon QLDB Developer Guide provides links to sample applications in each of these languages, which have examples of performing cryptographic verifications using that language.
If you have questions or suggestions, please leave a comment.
About the Author
Dan Blaner is a Senior Solutions Architect specializing in Amazon QLDB. He spends his days helping customers understand what ledgers are all about and helping them design and build system-of-record applications on Amazon QLDB.
Read MoreAWS Database Blog