SIMD means Single Instruction on Multiple Data. It is a class of computers that are capable of performing the same operations on several data all at once. As an example such a 1 GHz computer is capable of doing the addition (or substraction, multication...) of four pairs of numbers in one nanosecond whereas a non-SIMD computer will need 4 nanoseconds, one for each of the 4 additions of the pairs of numbers.
NSIMD is a vectorization library that abstracts SIMD programming. It was designed to exploit the maximum power of processors at a low development cost.
To achieve maximum performance, NSIMD mainly relies on the inline optimization pass of the compiler. Therefore using any mainstream compiler such as GCC, Clang, MSVC, XL C/C++, ICC and others with NSIMD will give you a zero-cost SIMD abstraction library.
Most of the"library is open library is open sourced on github and can be downloaded and tested at will thanks to its MIT license.
A small part of it is made of a proprietary binary at the price of 49.90 €/user
and can purshased at
store.agenium-scale.com . It contains among
- trigonometric functions
- inverse trigonometric functions
- hyperbolic functions
- inverse hyperbolic funƒctions
- exponentials - logarithms
We have put NSIMD into GROMACS to demonstrate its potential. GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is heavely used in the HPC community to bench super computers and has became a reference in this area.
As GROMACS is already a fully optimized software our goal is to obtain similar running times and we do! It also prooves the claims of NSIMD, namely low development cost for high performences and portability. We have replaced nearly 11000 lines of GROMACS code by 4700 lines of NSIMD code.
We work for the french Army and use NSIMD as the base library for our neural network inference engine. Its C++ API allows us to write all layer kernels once and have better performances than Caffe on Intel Workstations and Arm mobile devices (such as smartphones). We speed-up neural networks using quantizations and fixed-point arithmetic which are all supported by NSIMD.
We encountered several times very well optimized code written for one specific CPU using its vendor specific API. This situation becomes a problem when upgrading hardware even from the same vendor. A lot of people buy newer Xeon which are AVX-512 capable but have written their code for old AVX-capable only Xeons. That's when our translator comes into play. It is a clang-based program that takes your C/C++ code as input and chases down all vendor specific code. The output is C/C++ code whose calls to vendor APIs have been replaced by portable NSIMD code. This program saves you roughly 80% of translation time. The resulting code is then portable and uses the last SIMD capabilities.
Open Sourced version
A part of the library is open sourced on github (
A small part of it is made of a proprietary binary at the price of 49.90 €/user and can purshased at store.agenium-scale.com