Lattice QCD is a fundamental non-perturbative approach to solving the quantum chromodynamics (QCD) theory of quarks and gluons. The solution of the QCD problem is solved by a lattice gauge theory formulated on a grid or lattice of points in space and time. The calculation of SU(3) operation and D-Slash in high dimensions are typical data dense tasks. In recent years, the SIMD architecture of Intel processor has been greatly improved, especially the wide length of AVX512 SIMD is easily available. Although SIMD parallel has been studied applied to lattice QCD, two basic problems have not been solved well. The first is that vectorization strongly depends byte length of SIMD implementation and leads to poor portability. The second is that what is the optimal data parallel algorithm for lattice QCD applications.
In this work, we has studied the data parallel computation for the lattice QCD application in SIMD speedup and a unified vectorization model is presented. The goal is to improve computational performance without the portability loss. We also discuss potential data parallelism for lattice QCD calculation. The programming test work is based on Intel processors, like Intel KNL, Intel Xeon Gold Skylake processor and current Intel Xeon Gold Cascade-lake processor. The parallel efficiency of test results can meet well theoretical expectation of performance improvement with the increase of SIMD byte length. This work also compares with the SIMD optimization of lattice QCD on the TaihuLight supercomputer, ranked first in Top500 list from Jun 2016 until Nov 2017. The talk will report the related experimental results and theoretical analysis.