crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
Add an ARM scalar optimized implementation of BLAKE2s. NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too small for NEON to help. Each NEON instruction would depend on the previous one, resulting in poor performance. With scalar instructions, on the other hand, we can take advantage of ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an implementation get runs much faster than the C implementation. Performance results on Cortex-A7 in cycles per byte using the shash API: 4096-byte messages: blake2s-256-arm: 18.8 blake2s-256-generic: 26.0 500-byte messages: blake2s-256-arm: 20.3 blake2s-256-generic: 27.9 100-byte messages: blake2s-256-arm: 29.7 blake2s-256-generic: 39.2 32-byte messages: blake2s-256-arm: 50.6 blake2s-256-generic: 66.2 Except on very short messages, this is still slower than the NEON implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8,...
Showing
- arch/arm/crypto/Kconfig 9 additions, 0 deletionsarch/arm/crypto/Kconfig
- arch/arm/crypto/Makefile 2 additions, 0 deletionsarch/arm/crypto/Makefile
- arch/arm/crypto/blake2s-core.S 285 additions, 0 deletionsarch/arm/crypto/blake2s-core.S
- arch/arm/crypto/blake2s-glue.c 78 additions, 0 deletionsarch/arm/crypto/blake2s-glue.c
Please register or sign in to comment