High Performance Computing
SIMD Library

Overview of NSIMD, a SIMD vectorization library

What is SIMD programming?

SIMD means Single Instruction on Multiple Data. It is a class of computers that are capable of performing the same operations on several data all at once. As an example such a 1 GHz computer is capable of doing the addition (or substraction, multication...) of four pairs of numbers in one nanosecond whereas a non-SIMD computer will need 4 nanoseconds, one for each of the 4 additions of the pairs of numbers.

NSIMD: a vectorization library for optmized SIMD programming

NSIMD is a vectorization library that abstracts SIMD programming. It was designed to exploit the maximum power of processors at a low development cost.

To achieve maximum performance, NSIMD mainly relies on the inline optimization pass of the compiler. Therefore using any mainstream compiler such as GCC, Clang, MSVC, XL C/C++, ICC and others with NSIMD will give you a zero-cost SIMD abstraction library.

NSIMD for enhanced SIMD Optimization

Now that we have determined what is SIMD, NSMID and how it works let’s see in which case this library can optimize performances.

for C and C++ optimization

NSIMD provides C89, C++98, C++11 and C++14 APIs. All APIs allow writing generic code. For the C API this is achieved through a thin layer of macros; for the C++ APIs it is achieved using templates and function overloading. The C++ APIs provides operator overloading, higher level type definitions that allows unrolling. C++11, C++14 APIs add for instance templated type definitions and templated constants.

Binary compatibility is guaranteed by the fact that only a C ABI is exposed. The C++ API only wraps C calls.

NSIMD and SIMD intrinsics

SIMD intrinsics are low-level functions offered by chips to access SIMD units to speed up calculations. However it it is cumbersome and error-prone to take advantage of these. That's when NSIMD comes in. Indeed, NSIMD simplifies their use, by proposing an interface allowing to write a more intelligible and easily maintainable code and which facilitates its portability across different architectures. 

SIMD extensions supported by NSIMD

The list of supported SIMD instruction sets follows:
    - Intel:
  • + SSE 2
  • + SSE 4.2
  • + AVX
  • + AVX 2
  • + AVX-512 as found on KNLs
  • + AVX-512 as found on Xeon Skylake CPUs

    - Arm
  • + NEON 128 bits as found on ARMv7 CPUs
  • + NEON 128 bits as found on Aarch64 CPUs
  • + SVE
Support for the following architectures is on the way:
    - NVIDIA
  • + CUDA
  • + HIP

    - AMD GPUs
  • + HIP

  • + VSX
  • + VMX

Open source NSIMD and NSIMD Enterprise

Most of the"library is open library is open sourced on github and can be downloaded and tested at will thanks to its MIT license.

A small part of it is made of a proprietary binary at the price of 49.90 €/user and can purshased at store.agenium-scale.com . It contains among other:
- trigonometric functions
- inverse trigonometric functions
- hyperbolic functions
- inverse hyperbolic funƒctions
- exponentials - logarithms

Buy NSIMD for enhanced SIMD Optimization


We have put NSIMD into GROMACS to demonstrate its potential. GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is heavely used in the HPC community to bench super computers and has became a reference in this area.

As GROMACS is already a fully optimized software our goal is to obtain similar running times and we do! It also prooves the claims of NSIMD, namely low development cost for high performences and portability. We have replaced nearly 11000 lines of GROMACS code by 4700 lines of NSIMD code.

NSIMD for optimizing deep neural networks

We work for the french Army and use NSIMD as the base library for our neural network inference engine. Its C++ API allows us to write all layer kernels once and have better performances than Caffe on Intel Workstations and Arm mobile devices (such as smartphones). We speed-up neural networks using quantizations and fixed-point arithmetic which are all supported by NSIMD.

Use NSIMD  to speed-up code translation

NSIMD addition loop
for (int i = 0; i < n; i +=len(pack< float >())) {
storea(&c[i], loada(&a[i]) + loada(&b[i]));
Code translator from vendor specific-code to NSIMD

We encountered several times very well optimized code written for one specific CPU using its vendor specific API. This situation becomes a problem when upgrading hardware even from the same vendor. A lot of people buy newer Xeon which are AVX-512 capable but have written their code for old AVX-capable only Xeons. That's when our translator comes into play. It is a clang-based program that takes your C/C++ code as input and chases down all vendor specific code. The output is C/C++ code whose calls to vendor APIs have been replaced by portable NSIMD code. This program saves you roughly 80% of translation time. The resulting code is then portable and uses the last SIMD capabilities.

AARCH64 addition loop
for (int i = 0; i < n; i +=4) {
vst1q_f32(&c[i], vaddq_f32(
vld1q_f32(&a[i]), vld1q_f32(&b[i])));
SSE addition loop
for (int i = 0; i < n; i +=4) {
_mm_store_ps(&c[i], _mm_add_ps(
_mm_load_ps(&a[i]), _mm_load_ps(&b[i])));
AAVX addition loop
for (int i = 0; i < n; i +=8) {
_mm256_store_ps(&c[i], _mm256_add_ps(
_mm256_load_ps(&a[i]), _mm256_load_ps(&b[i])));
AVX-512 addition loop
for (int i = 0; i < n; i +=16) {
_mm512_store_ps(&c[i], _mm512_add_ps(
_mm512_load_ps(&a[i]), _mm512_load_ps(&b[i])));

NSIMD: Optimize your SIMD programming now!

Open Sourced version


A part of the library is open sourced on github ( lien no follow) and can be downloaded and tested at will thanks to its MIT license.

Proprietary license

49.90€ /month

A small part of it is made of a proprietary binary at the price of 49.90 €/user and can purshased at store.agenium-scale.com