Intel's hardware-accelerated CRC
Since CRC calculation is a common operation used in many different scenarios, Intel introduced a hardware version of the CRC-32 implementation (specifically CRC-32C) as part of the Nehalam-based Core-i7 product line. Part of the Streaming SIMD Extensions 4.2 (SSE4.2) instruction set, this new instruction accepts an initial CRC value, and a memory address pointing to the data whose CRC is to be calculated, allowing for processing of upto 8-bytes of data in a single instruction (on 64-bit machines).
Rather than having to write assembly code to use this instruction, Intel provides the Intrinsics library which exposes these instructions as handy C funtions. Furthermore, the library provides graceful degradation to a software implementation of the instruction for platforms that do not have hardware support. This makes the intrinsics library quite valuable when attempting to expose CRC calculation as a high-level function.
To explore the answer to this question, we decided to investigate just how much of an improvement in performance we can gain by writing a Node Addon to calculate CRCs, instead of using a pure JS CRC-32 library, which would get the job done at the expense of some performance degradation.
The SSE4_CRC32 Node module
The SSE4_CRC32 node module uses 3 languages:
- C - to use the Intel Intrinsics library to exploit the CRC32 instruction
- C++ - to provide V8 bindings for the Node module and interface it with the C function
CRC calculation can be progressive, which is useful for large requests for streaming audio or for one-off uses for JSON objects being saved to Riak.
We tested the new library under 2 specific use cases:
- A single fixed-length buffer of 1kb
- A large number of random buffers of varying length, upto 4kb
The tests were run on a Macbook Air running an Intel Core i7 processor, with 8GB of RAM and used buffers instead of strings to prevent having items on the V8 heap that might cause the garbage collector to fire frequently and interfere with the test run-times.
Below are the results from the 2 test cases:
>node benchmark/1.single_1kb_length_buffer.benchmark.js 100000 calls to calculate CRC on a 1024 byte buffer... SSE4.2 based CRC32: 26ms. Pure JS based CRC32 (table-based): 699ms. Pure JS based CRC32 (direct): 3704ms. >node benchmark/2.multi_random_length_buffer.benchmark.js 100000 calls to calculate CRC on random length buffers upto 4096 bytes long... Avg. buffer length: 2042 bytes SSE4.2 based CRC32: 62ms. Pure JS based CRC32 (table-based): 1968ms. Pure JS based CRC32 (direct): 8220ms.
The test results are quite astounding... even more so for the second case, where the C library with hardware acceleration was 31.74 times faster than the pure JS library!