In the field of computational science, we often run into data sets which are arranged in the form of large number of arrays or vectors. Among numerous examples of such datasets, across the fields of physics, chemistry and biology one of the most common is that of mass spectrometry data. Each mass spectrum is nothing but an array of pairs. Each pair consists of a mass-to-charge ratio and its corresponding intensity. Algorithms which work on such data need to sort these arrays as integral part of their processing. Since sorting is a time consuming process and when number of arrays is large (which is very common for MS data); sorting becomes a bottle neck for efficient processing. GPU-ArraySort provides a solution to such problems and helps boost the existing algorithms in computational sciences which require sorting of large number of arrays.
The algorithm is capable of sorting large number of moderately sized arrays in a few seconds on a GPU. Its design exploits the two levels of parallelism in a GPU along with the maximal use of its 100x faster on-chip shared memory. We use the popular sample sort strategy to distribute the work load across thousands of concurrent GPU threads. Sample sort helps disintegrate each array into smaller data independent units which can then be stored in shared memory and sorted by individual threads. This strategy leads to hundreds of threads working on sorting a single array, which provides a magnitude of speed-up. On a higher level, we map individual arrays to unique blocks thus fully exploiting the two levels of parallelism. To compensate for the limited in-core memory of GPU, the algorithm can operate in out-of-core manner. If the size of data to be sorted is larger than the in-core memory of GPU, we break it down into chunks and sort each batch separately. To counter the memory transfer bottlenecks created by using this approach of batch processing, we use CUDA streams. Using CUDA streams we can overlap memory transfer times with the computational times thus almost nullifying the delays caused by memory transfer bottlenecks.
You can read about all the details of the algorithm in our technical report http://scholarworks.wmich.edu/pcds_reports/5/
Also if you would like to use the implemented algorithm it is available as an open-source code at https://github.com/pcdslab/GPU-ArraySort-2.0
Please cite the following paper if you use the work:
Muaaz Awan and Fahad Saeed, “GPU-ArraySort: A Parallel, In-Place Algorithm for Sorting Large Number of Arrays,” 2016 45th International Conference on Parallel Processing Workshops (ICPPW), Philadelphia, PA, 2016, pp. 78-87. doi: 10.1109/ICPPW.2016.27
This blog post has been contributed by Muaaz Awan and edited by Fahad Saeed.