Hardware-accelerated CRC-32 for Node.js
Posted by Anand Suresh on 02 Dec 2013CRC and Javascript
A Cyclic Redundancy Check or CRC is a common error detection scheme that catches accidental changes to data blocks. It is simple to implement even on hardware, provides fast results and can be used in a progressive manner, making it perfect for use with streaming data like network packets or disk blocks. However, it relies heavily on bit-manipulation - a class of operations that have been sluggish in Javascript.
The reason for this slowness is in the way Javascript represents numbers; as 64-bit floating point values. Bit manipulation therefore requires a 64-bit number to be cast as a 32-bit integer, apply the bitwise operation, and then cast the resulting value back into a 64-bit floating point value. All of this happens under the hood, and while it doesn't affect performance too much, it becomes more prominent when bit-manipulation operations fall along hot code paths, such as the CRC calculation.
Intel's hardware-accelerated CRC
Since CRC calculation is a common operation used in many different scenarios, Intel introduced a hardware version of the CRC-32 implementation (specifically CRC-32C) as part of the Nehalam-based Core-i7 product line. Part of the Streaming SIMD Extensions 4.2 (SSE4.2) instruction set, this new instruction accepts an initial CRC value, and a memory address pointing to the data whose CRC is to be calculated, allowing for processing of upto 8-bytes of data in a single instruction (on 64-bit machines).
Rather than having to write assembly code to use this instruction, Intel provides the Intrinsics library which exposes these instructions as handy C funtions. Furthermore, the library provides graceful degradation to a software implementation of the instruction for platforms that do not have hardware support. This makes the intrinsics library quite valuable when attempting to expose CRC calculation as a high-level function.
Javascript vs. C/C++
While the underlying hardware may support a super-fast implementation of CRC, there is still no viable way to exploit that feature in Javascript. There is no explicit operator or function in the Javascript library to calculate CRCs, nor does the V8 engine have the intelligence to detect when Javascript code is attempting to calculate a CRC and generate the appropriate instruction sequence to utilize the hardware support. This brings us to an interesting question: when is it appropriate to leave Javascript space and start writing code in C? Or in other words, when should we sacrifice maintainability by introducing an additional language/platform/technology into the stack/code-base?
To explore the answer to this question, we decided to investigate just how much of an improvement in performance we can gain by writing a Node Addon to calculate CRCs, instead of using a pure JS CRC-32 library, which would get the job done at the expense of some performance degradation.
The SSE4_CRC32 Node module
The SSE4_CRC32 node module uses 3 languages:
- C - to use the Intel Intrinsics library to exploit the CRC32 instruction
- C++ - to provide V8 bindings for the Node module and interface it with the C function
- Javascript - to expose the C functionality in an easy-to-use manner
CRC calculation can be progressive, which is useful for large requests for streaming audio or for one-off uses for JSON objects being saved to Riak.
Test results
We tested the new library under 2 specific use cases:
- A single fixed-length buffer of 1kb
- A large number of random buffers of varying length, upto 4kb
The tests were run on a Macbook Air running an Intel Core i7 processor, with 8GB of RAM and used buffers instead of strings to prevent having items on the V8 heap that might cause the garbage collector to fire frequently and interfere with the test run-times.
Below are the results from the 2 test cases:
>node benchmark/1.single_1kb_length_buffer.benchmark.js
100000 calls to calculate CRC on a 1024 byte buffer...
SSE4.2 based CRC32: 26ms.
Pure JS based CRC32 (table-based): 699ms.
Pure JS based CRC32 (direct): 3704ms.
>node benchmark/2.multi_random_length_buffer.benchmark.js
100000 calls to calculate CRC on random length buffers upto 4096 bytes long...
Avg. buffer length: 2042 bytes
SSE4.2 based CRC32: 62ms.
Pure JS based CRC32 (table-based): 1968ms.
Pure JS based CRC32 (direct): 8220ms.
The test results are quite astounding... even more so for the second case, where the C library with hardware acceleration was 31.74 times faster than the pure JS library!
Conclusion
In this case, it was a clear win to move away from Javascript to C/C++ in order to boost performance. The cost of calculating CRCs in production code is now much lower than when running with pure Javascript making it more desirable to use at every level in the production stack.