A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and thei...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autor principal:	Bérenger Bramas
Formato:	article
Lenguaje:	EN
Publicado:	PeerJ Inc. 2021
Materias:	ARM SVE Vectorization Sort Electronic computers. Computer science QA75.5-76.95
Acceso en línea:	https://doaj.org/article/5ff224ab5cf5418281ea9aac2c7aa5ca
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:5ff224ab5cf5418281ea9aac2c7aa5ca
record_format	dspace
spelling	oai:doaj.org-article:5ff224ab5cf5418281ea9aac2c7aa5ca2021-11-21T15:05:13ZA fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)10.7717/peerj-cs.7692376-5992https://doaj.org/article/5ff224ab5cf5418281ea9aac2c7aa5ca2021-11-01T00:00:00Zhttps://peerj.com/articles/cs-769.pdfhttps://peerj.com/articles/cs-769/https://doaj.org/toc/2376-5992The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE’s interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.Bérenger BramasPeerJ Inc.articleARMSVEVectorizationSortElectronic computers. Computer scienceQA75.5-76.95ENPeerJ Computer Science, Vol 7, p e769 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	ARM SVE Vectorization Sort Electronic computers. Computer science QA75.5-76.95
spellingShingle	ARM SVE Vectorization Sort Electronic computers. Computer science QA75.5-76.95 Bérenger Bramas A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
description	The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE’s interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.
format	article
author	Bérenger Bramas
author_facet	Bérenger Bramas
author_sort	Bérenger Bramas
title	A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
title_short	A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
title_full	A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
title_fullStr	A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
title_full_unstemmed	A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
title_sort	fast vectorized sorting implementation based on the arm scalable vector extension (sve)
publisher	PeerJ Inc.
publishDate	2021
url	https://doaj.org/article/5ff224ab5cf5418281ea9aac2c7aa5ca
work_keys_str_mv	AT berengerbramas afastvectorizedsortingimplementationbasedonthearmscalablevectorextensionsve AT berengerbramas fastvectorizedsortingimplementationbasedonthearmscalablevectorextensionsve
_version_	1718418825898098688

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

Ejemplares similares