TurboSFV - BLAKE3 with AVX2 support

	TurboSFV - Blog

TurboSFV
2025-08-16 12:16:12
TurboSFV v10.60 - BLAKE3 with AVX2 support
Notes to TurboSFV v10.60: This new version of TurboSFV offers a special implementation of the BLAKE3 algorithm, which is based on the CPU instruction set extension AVX2. AVX2 was introduced in 2013 by Intel and is an instruction set expansion of AVX (Advanced Vector Extensions). Along with new instructions, AVX2 added support for the processing of integer numbers. Both can operate with YMM register. Comparing to SSE (Streaming SIMD Extensions), the main difference is, that AVX2 doubles the register width: AVX2 uses YMM registers (256 bit), while SSE is based on XMM registers (128 bit). Like with SSE, the register can be divided into 32-bit lanes, but now one can hold eight 32-bit numbers and perform SIMD wise (Single Instruction Multiple Data) arithmetic operations with them, instead of four 32-bit numbers with SSE. At then end, this doubles the possible parallelism, when dealing with 32-bit numbers, like the BLAKE3 algorithm does. BLAKE3 was designed for a parallel computation, because it's based on a binary tree: Nodes of the tree can be computed in a parallel manner, assembled to parent nodes and added to the tree after the operation. With the AVX2 technology, input data can be processed lane wise with one CPU instruction, eight lanes at the same time. Comparing to SSE, this doubles the throughput for a single instruction. The speed improvements depend on the used hardware, especially the CPU type and its quality of the AVX2 implementation, and how fast data can be transferred between the components of the board. Furthermore, the input data needs to be prepared in order to be able to process it SIMD wise. This takes extra time, which isn't needed in non-SIMD implementations. In other words, only smaller improvements up to double speed comparing to SSE should be possible. Like for SSE, a parallel computation on multiple CPUs (because of the binary tree: 1, 2, 4, ... CPUs) is also possible and can be enabled in the configuration of TurboSFV. This again accelerates the overall calculation speed. Similar to the SSE implementation, the used CPU must support the needed AVX2 instructions and the system must provide the YMM registers. Also, the AVX2 support is only available in the 64-bit version of TurboSFV, because in x86 mode, only half of the YMM registers are available, with a negative impact for the speed. If you like to add a comment regarding this new version or, if you even want to report the calculation speed on your system (some information about the used hardware would be useful), then this is the right place for doing it.
TurboSFV Cologne, Germany