 | TurboSFV - Blog | | |
|
TurboSFV |
2025-05-31 12:38:46 |
TurboSFV v10.50 - BLAKE3 with SSE support |
Notes to TurboSFV v10.50:
This new version comes with a special SSE (Streaming SIMD Extensions) based implementation of the BLAKE3 algorithm: The used SSE instructions allow a parallel computation of input data, resulting in a much better calculation speed of
BLAKE3
checksums.
The BLAKE3 algorithm basically works as follows: The input data is divided into chunks, each with a size of 1024 bytes. These chunks are further split into blocks of 64 bytes. Each block will be then processed in 7 rounds, by applying
basic arithmetic operations. During these rounds, sixteen 32 bit state variables are used for keeping the intermediate results. At the end, they build the input for the next round, together with the next
block.
The output of the last block of a chunk represent the output of the chunk. This chunk will be then added as node to a binary tree: In a binary tree, one node, the parent node, has two child nodes. As chunks are added to the tree as child
nodes, two of them can build a parent node, by following a specific algorithm. Then, the next chunks are processed, until all input bytes are consumed. This ends up in a root node, which delivers the final hash
value.
The SSE implementation in TurboSFV uses XMM register, each 128 bit wide. The SSE technology allows us, to divide the XMM register into four parts with each 32 bit, which are called lanes. With the help of special SSE instructions, we can
now apply the same arithmetic on all four lanes at the same time. As BLAKE3 internally works with 32 bit numbers, we can now process four chunks in a parallel manner, by using a single instruction for one single XMM
register.
Furthermore, in 64 bit mode, sixteen XMM register are available, allowing us to keep the sixteen 32 bit state variables for each lane in register. Means, we can now process four chunks separated in lanes on one single CPU and at the end,
we add the results as an assembled single node to the binary tree. By doing this, we get more or less double calculation speed. Between, this is the reason, why the SSE based implementation is only available in TurboSFV x64, and not in
TurboSFV x86: In 32 bit mode, only eight XMM register are
available.
In addition, nothing keeps as away from doing the same procedure on multiple CPUs - because of the binary tree, we use 1, 2, 4, 8 CPUs and so on. This again accelerates the process, if the storage medium can deliver the input data with an
appropriate
speed.
The used CPU must support the used instructions (up to instruction set SSE4.1), otherwise TurboSFV uses the legacy instructions.
Comments regarding this new version can be added here.
|
TurboSFV Cologne, Germany |
|
|
|
|
|