CuBLonD usage and performance

Easy to merge, so it should be integrated into the basic BLonD code. Code can be used for mixed MPI multi-CPU and GPU environment. A cluster with 4 GPUs or more would be ideal. (in most heavy simulations, CPUs were used just for small calculations)