Sorting Large Number of Arrays using Graphical Processing Units (GPU’s)

In the field of computational science, we often run into data sets which are arranged in the form of large number of arrays or vectors. Among numerous examples of such datasets, across the fields of physics, chemistry and biology one of the most common is that of mass spectrometry data. Each mass spectrum is nothing but an array of pairs. Each pair consists of a mass-to-charge ratio and its corresponding intensity. Algorithms which work on such data need to sort these arrays as integral part of their processing. Since sorting is a time consuming process and when number of arrays is large (which is very common for MS data); sorting becomes a bottle neck for efficient processing. GPU-ArraySort provides a solution to such problems and helps boost the existing algorithms in computational sciences which require sorting of large number of arrays.

The algorithm is capable of sorting large number of moderately sized arrays in a few seconds on a GPU. Its design exploits the two levels of parallelism in a GPU along with the maximal use of its 100x faster on-chip shared memory. We use the popular sample sort strategy to distribute the work load across thousands of concurrent GPU threads. Sample sort helps disintegrate each array into smaller data independent units which can then be stored in shared memory and sorted by individual threads. This strategy leads to hundreds of threads working on sorting a single array, which provides a magnitude of speed-up. On a higher level, we map individual arrays to unique blocks thus fully exploiting the two levels of parallelism. To compensate for the limited in-core memory of GPU, the algorithm can operate in out-of-core manner. If the size of data to be sorted is larger than the in-core memory of GPU, we break it down into chunks and sort each batch separately. To counter the memory transfer bottlenecks created by using this approach of batch processing, we use CUDA streams. Using CUDA streams we can overlap memory transfer times with the computational times thus almost nullifying the delays caused by memory transfer bottlenecks.

You can read about all the details of the algorithm in our technical report

Also if you would like to use the implemented algorithm it is available as an open-source code at

Please cite the following paper if you use the work:

Muaaz Awan and Fahad Saeed, “GPU-ArraySort: A Parallel, In-Place Algorithm for Sorting Large Number of Arrays,” 2016 45th International Conference on Parallel Processing Workshops (ICPPW), Philadelphia, PA, 2016, pp. 78-87. doi: 10.1109/ICPPW.2016.27


This blog post has been contributed by Muaaz Awan and edited by Fahad Saeed.


Surfing with the Algorithms

Our research on compressive- and reductive-analytics for big Mass Spectrometry (MS) data was showcased in ScienceNode and can read at this linkScienceNode is an online publication which is funded by NSF and CERN.

Our research on using reductive data sets to make sense of big data from Mass Spectrometry instruments was included due to the significance and novelty of the algorithm. Since this is an entirely new way to think about analyzing vast amount of MS data; we feel that our solutions will be adopted by a wide variety of proteomics practitioners and tools developers.

Our paper describing the research appeared in Oxford Bioinformatics and can be read at this link.

In order to facilitate the tool developers we have made our implemented code available as an opensource. The code can be downloaded from lab portal here.

Please feel free to send us your questions or comment below. We will be happy to help if you are a computational scientists who is developing a tool or a system biologist/proteomics expert who might be interested in using our tool.

Intel Excellence in Computer Science Award for PCDS K-12 Student

Lily Kitagawa is a high-school student who has been working in PCDS Lab from the past few months. She has been working on developing algorithms that can deal with big genomic data sets.

She recently participated in the 16th Annual Southwest Michigan Regional (Intel Science and Engineering) ISEF Science Fair for high school students (K-12) Kalamazoo Area Math and Science Center (KAMSC).

Lily presented her work using a research poster entitled “Efficient k-mer Counting of DNA Sequencing”. She designed and implemented a trie-based structure which allowed her to count the k-mer frequencies in large genomic data sets much faster as compared to other solutions.

For her efforts she won the Intel Excellence in Computer Science Award. Congratulations Lily !


MS Thesis Defended

Congratulations to  Dana Abdul-Qader for defending her MS Thesis titled “High Performance Architecture for an Exact Match Short-Read Aligner Using Burrow Wheeler Transform on FPGA’s”. Her thesis establishes a method that runs on FPGA hardware and performs 82x times faster than software version of the aligner.

Dana is the first student to graduate from PCDS Lab. We wish her all the best in her future endeavors !


Dana with her Thesis Committee: Prof. Janos Grantner, Prof. Fahad Saeed, Dana Abdul-Qader and Prof. Elise de Doncker

Visit by Prof. Khokhar and talk by Umer Cheema

Recently, we invited Umer Cheema from University of Illinois at Chicago to give a seminar about his latest research. He works on developing hardware-centric solutions to computational problems. He was accompanied by Prof. Ashfaq Khokhar who joined us from Illinois Institutes of Technology in Chicago.

IMG_5819Umer giving a talk on his FPGA based computational solutions to various problems (the light was not good and his eyes might be shut right now)

IMG_5820Various students from ECE and CS departments, including from PCDS Lab, listening intently to Umer’s talk and what he had to say.

IMG_5822Profs. Khokhar and Saeed can be seen discussing an issue related to research. Prof. Khokhar also got a grand tour of the lab and the computational facilities that are available to the students/researchers.

IMG_5826Drs. Saeed and Khokhar.

IMG_5831From Left to Right: Muhammad, Muaaz, Sandino, Prof. Khokhar, Prof. Saeed and Umer. We were all standing in PCDS Lab located in B115 Parkview Campus WMU

IMG_5835At the end of the research tour we ended up in a power-boat and enjoyed the cool breeze of the lake here in Kalamazoo (all thanks to Prof. Gupta)

IMG_5845Profs. Gupta and Khokhar at the end of our cool boat ride

High Performance Compression Technique for Genomic Big Data using Memory-Distributed Compute Clusters

High-throughput Next-generation sequencing (NGS) techniques are used to sequence the DNA and to structured the genome of an organism, for example the footprints that make us human. The amount of data that is generated by this process is huge and are generally referred as Big Data. Keeping these datasets intact before the processing stage is important due to the preservation of key elements that may become meaningful in future analyses, like finding genes that can predispose a patient to develop cancer.

Big data challenges the way we compute and deal with analysis, storage, visualization and transfer of data. One of the big challenges is how to store data in an efficient manner so that costs can be minimized ( smaller the data, the smaller the data disks that you would need and hence less cost).

Storing and transferring these massive amounts of data represents a challenge that has led to the creation of highly specialized compression techniques. Compression allows us to reduce the file size such that the resulting file is smaller than the original and by applying the opposite process (decompression) the exact original file can be obtained. However, these specialized algorithms suffer from poor scalability with increasing size of the datasets i.e. the compression takes more time as the size of the data increases. The best available solutions can take hours to compress Gigabytes of data, which is a major drawback since a single run of a sequencing machine can generate several Terabytes of genetic code in a few hours.

Recently we have presented a novel technique to compress big genomic data using large number of interconnected compute nodes ( known as compute clusters). Our finding were reported in a paper titled “A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets” that was accepted in the Proceedings of  Proceedings of Parallel and Distributed Processing with Applications (IEEE ISPA-15). In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model.

A message passing model utilizes the many nodes of the modern processors and distribute the work equally among them. Hence instead of a processor doing all the job serially, the jobs is distributed on each node in the cluster and compresses part of the data. Since many processors are being used in parallel to accomplish the task the time of compression can be reduced significantly. The results presented in the paper showed a reduction in compression time by a factor inversely proportional to the amount of processing units (cores or processes). The experiments showed that the more processing units we utilized the less time it took for the datasets to be compressed.

Our algorithm paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughput’s (amount of work per unit of time) of up to 8.4GB/s on a moderate size cluster using 128 cores. This means that every second around 8.4 gigabytes of data was compressed. If we have a 100 gigabyte NGS file, it  will take approximately 22 seconds to compress the whole dataset using our implementation versus hours using general compressors like WinRar, WinZip, GZip or 7-Zip.

Our findings represent a first attempt to use the message passing model and memory-distributed machines to parallelize the compression process specific to NGS data.  Our hope is that the proposed high-performance technique will help scientists (specially biologists) to compress NGS data in a very short amount of time. Also the transportation of these files will be faster once compressed, saving a lot of bandwidth.

The implementation of our algorithm paraDSRC can be accessed at this link.

Our paper entitled “A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets” accepted in the Proceedings of Proceedings of Parallel and Distributed Processing with Applications (IEEE ISPA-15) can be accessed as a Technical Report at this link.

_________________________________________________________________This blog post has been contributed by Sandino Perez and edited by Fahad Saeed.

Generating graphical user interfaces for domain scientists

Domain sciences such as biology, astronomy, physics, chemistry are being afflicted with big data deluge due to high-throughput technologies. This makes most of the sciences rely on computational tools to process their data and draw conclusions. However, domain scientists might have received inadequate training in computational science, especially in parallel computing. In our lab we are trying to make parallel computing tools accessible to domain scientists.

One of the problem faced by many domain scientists who might not have been trained in a computational space is the “fear” of using a command line tool. Although CS people might be very comfortable with a piece of code that accepts arguments on a command line, a domain scientists might not get what is going on.

To this end, we have developed a java based program that takes in a java program (that you intend to run) and converts it into a graphical user interface (GUI). A domain scientists then can potentially use the GUI with ease instead of being intimidated by a command line program.

IMG_5416Left to Right: Dr. Saeed, Binh Le, Adam Coles

Two high-school students (Binh Le and, Adam Coles) from Kalamazoo Area Math and Science Center (KAMSC) have been working under Dr. Saeed’s supervision to create a program as described in the para above. As a proof-of-concept they developed a java based program that takes as input our Pyro-Align program and converts it into a GUI based program. They then demonstrated that the Pyro-Align program can be run using the GUI.

In the picture above Binh and Adam are shown presenting their research poster at Southwest Michigan Science and Engineering Fair for high school students (K-12) Kalamazoo Area Math and Science Center (KAMSC) which was held at Western Michigan University on 20th March 2015. We would like to thank Muaaz Awan (PhD student PCDS) who helped us during this process.