TurboSFV - BLAKE3 with SSE support

	TurboSFV - Blog

TurboSFV
2025-05-31 12:38:46
TurboSFV v10.50 - BLAKE3 with SSE support
Notes to TurboSFV v10.50: This new version comes with a special SSE (Streaming SIMD Extensions) based implementation of the BLAKE3 algorithm: The used SSE instructions allow a parallel computation of input data, resulting in a much better calculation speed of BLAKE3 checksums. The BLAKE3 algorithm basically works as follows: The input data is divided into chunks, each with a size of 1024 bytes. These chunks are further split into blocks of 64 bytes. Each block will be then processed in 7 rounds, by applying basic arithmetic operations. During these rounds, sixteen 32 bit state variables are used for keeping the intermediate results. At the end, they build the input for the next round, together with the next block. The output of the last block of a chunk represent the output of the chunk. This chunk will be then added as node to a binary tree: In a binary tree, one node, the parent node, has two child nodes. As chunks are added to the tree as child nodes, two of them can build a parent node, by following a specific algorithm. Then, the next chunks are processed, until all input bytes are consumed. This ends up in a root node, which delivers the final hash value. The SSE implementation in TurboSFV uses XMM register, each 128 bit wide. The SSE technology allows us, to divide the XMM register into four parts with each 32 bit, which are called lanes. With the help of special SSE instructions, we can now apply the same arithmetic on all four lanes at the same time. As BLAKE3 internally works with 32 bit numbers, we can now process four chunks in a parallel manner, by using a single instruction for one single XMM register. Furthermore, in 64 bit mode, sixteen XMM register are available, allowing us to keep the sixteen 32 bit state variables for each lane in register. Means, we can now process four chunks separated in lanes on one single CPU and at the end, we add the results as an assembled single node to the binary tree. By doing this, we get more or less double calculation speed. Between, this is the reason, why the SSE based implementation is only available in TurboSFV x64, and not in TurboSFV x86: In 32 bit mode, only eight XMM register are available. In addition, nothing keeps as away from doing the same procedure on multiple CPUs - because of the binary tree, we use 1, 2, 4, 8 CPUs and so on. This again accelerates the process, if the storage medium can deliver the input data with an appropriate speed. The used CPU must support the used instructions (up to instruction set SSE4.1), otherwise TurboSFV uses the legacy instructions. Comments regarding this new version can be added here.
TurboSFV Cologne, Germany