Project 3
cs599 Parallel Programming and Algorithms
20 Points
Due Apr. 2, 2015 at Midnight
You must run your program on Stampede to receive full points for this assignment
Instructions for running your program on Stampede are provided at the end of this document

Description

Objective:  To learn about MPI collective operations and parallel sorting.

Sample Sort

Write a program that will perform sample sort on a set of randomly generated data. You may assume that your program will take exactly one command line parameter as input. This parameter will be the number of total elements for your program to process. This number of elements will be divided as evenly as possible between each of the processes that are used within your program. After reading in the number of elements to process, you should generate the appropriate amount of random data. Generally this will be numElements/npes, where numElements is the total number of elements to process, and npes is the number of processes. Next, you should sort the data locally using a tool such as qsort.

Once you have obtained the locally sorted data, determine local "buckets" for your data. You will divide your local data into npes buckets. To do so, you can easily find npes-1 values within each process's array of data by stepping through the array with a stride of numElements/(npes*npes). You can find an example of this at the following link. After performing this "local sort and sample selection", you should combine these samples in the master process (i.e. in process rank 0). To do so, make use of the MPI_Gather function to gather up all the samples. Then pick npes-1 values from the samples by first sorting the samples in process 0, then stepping through the array of all the samples with a stride of npes-1.

The sample values obtained in process 0 must be broadcast to all other processes. After these values have been received in each process, these values can then be used to divide each process's local data into "buckets". Since each process's local data is already sorted, bucket locations can be determined by performing a binary search on each process's data. If performed sequentially, these searches should provide a displacement and count for each process's buckets. Next, we will allow for each process's rank to represent its bucket number. Thus process 0 should receive all data from all processes less than the first sample. Process 1 should receive the next chunks of data from all processes, and so on. These operations can be performed by first performing an all-to-all transfer of bucket sizes with a call to MPI_Alltoall. Next, the displacements of each chunk of data can be computed in each process. Finally, the data can be transferred with a call to MPI_Alltoallv.

Once the appropriate bucket data has been obtained by each process, each process should sort its bucket data using a tool such as qsort. Now, all data is sorted with the smallest data being stored in the process with the lowest rank and the largest data being stored in the process with the highest rank. Such data can be gathered in one process and printed. Your program should perform these operations.

An example of the output that your program should print is provided below.

The run time is 0.003498 s
The sorted elements of the array are:
68
81
127
415
418
515
534
559
602
691
709
872
706
853
930
949

The example above was produced with the following command:

mpirun -np 4 ./SampleSortSoln.exe 16

Note that your program should show the runtime as well.

Submission and Execution

Execute your program on the Stampede supercomputer using 16, 32, and 64 processes to sort 128,000,000 values.

Submit your program file, your three runtimes, an sbatch file, and your makefile, and a README file describing how to compile and run your program, and your comparison to the eCompanion dropbox for Project 3.