Why does DynamoDB implement CRC32 if it uses GZIP algorithm on layer 7?
One of my very short list of hobbies is digging into SDKs source-code to broaden my horizons of how certain things work under the hoods. It isn’t always a straightforward task to understand them at a glance, but once you get accustomed to reading git history logs as well as to watch for issues/PRs, this becomes a bit more trivial. Recently, I have taken a closer look at the aws-sdk-java repository with the sole purpose of understanding why I was getting an exception similar to this one below when using DynamoDB:
Caused by: com.amazonaws.internal.CRC32MismatchException: Client calculated crc32 checksum did not match that calculated by server-side
at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:112)
at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:42)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:1072)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:746)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
The first thing everyone would do in this situation is to google it and look for possible solutions (Stackoverflow rocks!). For my surprise, the majority of the results returned were pointing to an issue when enabling GZIP compression at the SDK level which can be extremely beneficial for many services (S3, CloudFront, among others) but apparently, for some reason that I couldn’t understand why it didn’t gel with DynamoDB. At that point, I was convinced that the issue was fairly easy to understand as if you take into consideration that CRC32 usually doesn’t apply for payloads (that’s why we have other checksum algorithms for) and GZIP compression when used in conjunction with the HTTP protocol applies for both payload and headers (unless you specify otherwise), then you would realize that this would require the response to be deflated before verifying the CRC32 checksum. If for some reason the piece of code in charge of validating this checksum was expecting a String instead of an InputStream which needs to be deflated first, then this would certainly cause a problem when validating it (this has happened on the Android SDK not long ago). It turns out that the problem was a bit different from it as you can see on this thread and the file that was committed to fix it.
Hold on… if GZIP is a lossless compression algorithm which won’t deflate correctly if some data gets corrupted, then why would I ever need CRC32 checksums to be used for the same purpose?
Well, I believe this is where I need to live up to the blog’s title (“Trying to understand what nobody is willing to explain”) and consequently, to go a bit deeper to ensure we won’t get the right answers but for the wrong questions. Hang in there; this is going to be interesting. But first, it’s essential that we are on the same page, and for that reason, I will go through some of the Wikipedia’s definitions and gotchas I made throughout my career.
Technology: CRC
Description: A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data. Blocks of data entering these systems get a short check value attached, based on the remainder of a polynomial division of their contents. On retrieval, the calculation is repeated and, in the event the check values do not match, corrective action can be taken against data corruption.
Gotchas: Exactly! So let’s go even deeper and understand how one of these use cases mentioned uses this error-detection algorithm.
Technology: Network — TCP
Description: The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets between applications running on hosts communicating over an IP network.
Gotchas: Well, I wish I could take that sentence to a more literal approach as this would undoubtedly make my life much easier, but unfortunately, this isn’t the case. Let’s take a look at how CRC checksum is used in the TCP stack and possibly to determine why I can’t rely on that affirmation blindly. As DynamoDB doesn’t provide a dual stack endpoint to support IPv6 connections at this time, I will focus on the IPv4 only to avoid from drawing attention to the wrong matter.
According to the TCP/IP Guide (slightly modified for better understanding): Once this 96-bit header has been formed, it is placed in a buffer, following which the TCP segment itself is placed. Then, the checksum is computed over the entire set of data (pseudo header plus TCP Segment Length). The value of the checksum is placed into the Checksum field of the TCP header. You can see below in red the fields that are part of the pseudo-header and where the checksum header field is located in the TCP packet. It is worth to reiterate that the TCP checksum header on layer 2 doesn’t take into consideration the actual payload of the TCP packet but the TCP segment length. It is somehow obvious to understand this is implemented that way for performance reasons.
Okay but… if TCP will ask for another package in case of data corruption and the TCP checksum header on layer 2 of the OSI model is in charge of it, then why can’t I simply rely on the protocol since it’s explicitly written on Wikipedia that TCP provides reliable, ordered, and error-checked delivery of packets?
I truly think this is where we draw an artificial line that separates new technologies developers from business applications developers. Mostly business applications developers would have probably changed this checksum algorithm as soon as they had realised it wasn’t as efficient as first intended which needs to be considered as good practice in many applications we use daily (Take the security IT field when they discover a security flaw for instance). However, when the TCP protocol was first thought, I’m confident that they knew already of the possibility of some packets to be marked as non-corrupted and to be delivered to the next layer even if this wasn’t absolutely true. Then again, although the maths behind the polynomial formulas to calculate the error margin is a bit complicated, you don’t need to be a genius. Perhaps you might need some time to remember of what you’ve studied at high school and uni, but you will eventually figure that out.
Having said that, what might they have possibly thought after all? If you’re like me that enjoy spending more time reading the official documentation rather than possibly making false assumptions, you will be pleased to know that another point established at the OSI model is that each layer is responsible for implementing its own checksum mechanism as they address different problems in a complementary way. This has slightly changed on IPv6 specification, but the layer 2 checksum header still exists there too. There are very interesting references on this matter that go to the deepest level that this conversation can get to. I will list them below for further reference:
- The Limitations of the Ethernet CRC and TCP/IP checksums for error detection
- Can a TCP checksum produce a false positive? If yes, how is this dealt with?
Great, I believe now we are on the same page Paulo but why on earth does DynamoDB implement CRC32 on its requests? Please don’t tell that this is just for backward compatibility
No, it’s not at all, although AWS is particularly famous for keeping things compatible with older clients to ensure the next release won’t break your existing code. Occasionally things go out of control but this is technology after all. However, before answering this question directly, it is worth mention that the RFC that established the HTTP 1.1 protocol used by DynamoDB communications nowadays has a header called Content-MD5 which is (over-simplified explanation) responsible for checking if there was any data corruption during the transmission of the payload. The key point to understand this is that while the Content-MD5 requires the entire data to be read in order to be able to verify whether or not there was data corruption, the CRC32 header doesn’t contain all the data of the HTTP entity and for that reason, its verification is astronomically faster.
After all theses concepts and clarifications I believe the answer boils down to these bullet points:
- One thing that we all need to bear in mind is that DynamoDB is a service that provides most of the times single-digit millisecond latency responses, so every microsecond counts.
- It’s essential that you understand that as DynamoDB is a database service, this makes data integrity a priority zero concern. You don’t want to read data corrupted after the service has responded everything went fine, right? So one more lightweight verification won’t hurt.
- That’s true that if you have GZIP compression enabled then you “no longer need it”. But let’s not forget a few things:
- Although GZIP has a good balance between compression-rate and computational overhead to compress, it still takes a few more microseconds to milliseconds, and not all workloads can afford that luxury.
- GZIP compression is a fairly recent feature (it’s been a few years already) when compared to how many years many services have been available to customers (some since 2006).
- There still existing and new workloads that will benefit from having CRC32 checks without GZIP depending how time sensitive they are.
Conclusion
If I have learnt something throughout all these years working in the IT field is that although we might deal with a so-called “exact science”, we must absolutely know how to cope with the nuances of the technologies we choose to work with or in my case to suggest to any customer. This is unquestionably a great responsibility that comes with a lot of fun in the process of discovering them.
Let me know what you thought about this subject. Feedback is always welcome ;)