Program 2
cs550 Operating Systems
Due date:  Tuesday Sept. 30, 2014 at 11:59:59 p.m.
15 points

Objective:  To learn about processes, threads, and parallel decomposition.

Specification:

For Program 2, you are to write two programs that will perform a blurring operation on a matrix.  The first program must be written in standard C and must use only pThreads as its parallel construct, and the second must be written in C using MPI and must use only MPI-style processes as its parallel construct.  The details of image blurring are quite simple and this program will keep them in an even simpler format: you may assume you have one square (N by N) matrix called A with generated data or data provided as input (0 to 255) and one square (3 by 3) matrix B that contains values of 1/9 in every cell. You will apply matrix B to every element of matrix A to obtain a "blurred" image in a third matrix (N by N) C. For this program, you may assume that the number of rows or columns (both are equal to N) will be divisible by the number of threads or processes available.  This means that N % NUM_THREADS = 0 and N % NUM_PROCESSES = 0.  In your programs, you are to divide blurring operations evenly between processes and threads, respectively.  This means each thread or process must perform the amount of work, which is the same number of blurring operations.  Note that a blurring operation is accomplished for one location (i,j) in matrix A by performing point to point multiplications of the contents of matrix B to A(i,j) and all of its 8 surrounding points (i-1,j-1), (i-1,j), (i-1,j+1), (i,j-1), (i,j+1), (i+1,j-1), (i+1,j), and (i+1,j+1). Be careful when working with the borders of matrix A, you may assume any locations that do not exist are equal to zero.

Threaded Program (7.5 points)

Inputs

For the threaded program, you are required to provide as input the size of the matrices A and C (this will only be one value) and the number of threads to run.  Read in this data from the console (stdin) on one line providing the matrix size first and the number of threads second.  Information on the data to be stored within the matrices A and C is provided later in this document.  For the purposes of computation, you may assume that the matrix will be of at least size 4 by 4 and that the number of threads will be greater than or equal to one.

Outputs

Upon valid input, your program should print out the first two rows of the matrix C and the last two rows of the matrix C.  This output should be neat.  That means you should include return characters between multiple lines of output and denote which row you are providing as output.  Upon invalid input, your main thread must print an error message.  For example, you must indicate that there is invalid input if the matrix size is less than 4 by 4, if the number of threads requested is less than 1, and if the number of rows of matrix C to be computed by each thread is a non-integer value (i.e. N % NUM_THREADS != 0).  Additionally, you must provide the run time of your program in seconds as output.  This time must start before you allocate the memory for your matrices and must stop after the matrix blurring is complete, but before you print your result.  You will need to #include <time.h> and use clock() to get the starting time.  Clock time values are displayed in clock units, and are stored in primitive variables of type clock_t.   You can get the run time by subtracting the start time from a call to clock() after all computations are complete.  This clock time can be converted to seconds by dividing by CLOCKS_PER_SEC, a constant in time.h.  Do NOT use another function to compute the run time.

Shared Data

You may use the following "global" variables.  These variables will be shared between the matrix blurring threads and the main thread.

static const int NUM_THREADS;  //The number of threads.
static const int N;  //This is the size of one of the sides of the matrix.  It should be possible to read in one value and store it in N.  Once set this value cannot be changed.
#define ninth (1.0/9.0) static double ** A;  //The matrix A (dynamically allocated).
static double B[3][3] = {{ninth, ninth, ninth}, {ninth, ninth, ninth}, {ninth, ninth, ninth}};  //The blurring filter.
static double ** C;  //The result of A blurred by B is stored here. This is dynamically allocated. matrix C.

Matrices A, B, and C may be shared between the main thread and the addition threads.  These matrices will be represented as 2D arrays, but A and C will be dynamically allocated.  Therefore, these must be of type double **.  The values for the number of threads and the matrix size will be provided as input.  Even though these are constants, if they are not initialized, you should be able to set their values exactly once (for example, by using an input value).

Computations

You must implement the following function in your program.   You may use additional helper functions if you so desire; however, you must use at minimum the function listed below.

void * matrix_blur_operation(void *) - this thread function must take only the thread id number as a parameter and use it to compute N/NUM_THREADS rows of the matrix C.

Main Function

The main function of this program must read in the input data as explained above, perform error checking on the data, and print the output (the first and last two rows) of the matrix blurring.  Note that the input data must be provided at the console and will be two values on one line -- the matrix size and the number of threads.  If these values are invalid as described above, you are to print an error message and exit the program immediately.  Otherwise, allocate memory for the matrices A and C.  Fill the each element of the matrices A with the current row number.  That is, if you are working with A[1][2], its value should be 1, and if you are working with A[4][3], its value should be 4.  This way it will be quite easy to tell if your matrix addition is working properly.  Assuming valid input and that the matrices A, B, and C have been created, start the appropriate number of threads from main using pthread_create within a loop.  Be sure to perform error checking to ensure these threads have started.  After all threads have started, have the main thread wait for each worker thread (those that are performing addition) to complete.  This is done by calling the pthread_join function on every worker thread.  Finally, print the results as described in the output section.

Compilation

Your program must read exactly one line of input from the keyboard (or a redirected input file).  Failure to do so will result in a loss of points on this programming assignment.  Save the source code for the threaded portion of program 2 in a file named prog2Threads.c.  Do NOT give the file a different name.  You will lose points if you give the file any other name than prog2Threads.c.

To compile your program use the following command:

    gcc prog2Threads.c -pthread -o prog2Threads.exe

This will compile your program using the gcc compiler, and it will create an executable file named prog2Threads.exe.  Assuming that there are no errors in your program, you may run it using the command ./prog2Threads.exe.

To run this program, it is suggested that you use the UNIX redirection commands to provide input and obtain output from your program.  Run your program using the following command:

prog2Threads < p2t.in > p2t.out

To obtain input from the file p2t.in, and to send the output of your program to the file named p2t.out.  You may give these input and output files any name that you wish since you will not be submitting them.  The names provided are only examples.

MPI Program (7.5 points)

Inputs

For the MPI program, you are required to provide as command line input the size of the matrices A and C (this will only be one value).  This input data will be provided on the same line as the ibrun prog2Procs.exe in the batch script.  Note that the value will be provided immediately after the .exe file.  Command line inputs are stored as parameters to the main function.  Main always has at least one command line input -- the name of the program.  This ensures the integer argc is at least one and that argv[0] is a string containing the name of the program.  Since you will use only one command line input, argc will be 2 and argv[1] will be a string containing the matrix size.  This will need to be converted to an integer and stored as the value N.  The number of processes and nodes will be provided using the -n and -N options.  The -n option allows you to specify the total number of processes to run, and the -N option allows you to specify the number of nodes to use.  On Stampede, you are limited to use at most 16 processes per node (equal to the number of CPU cores available).  Using more computational processes than there are CPU cores is called oversubscribing and is not usually permitted by the operating system on High Performance Computers.  Information on the data to be stored within the matrices A and C is provided later in this document.  For the purposes of computation, you may assume that the matrix will be of at least size 10 by 10 and that the number of processes will be greater than or equal to two.

Outputs

Upon valid input, your program should print out the first two rows of the matrix C and the last two rows of the matrix C.  This output should be neat.  That means you should include return characters between multiple lines of output and denote which row you are providing as output.  Upon invalid input, your main function must print an error message.  For example, you must indicate that there is invalid input if the matrix size is less than 10 by 10, if the number of processes requested is less than two, and if the number of rows of matrix C to be computed by each worker process is a non-integer value (i.e. N % (NUM_PROCESSES-1) != 0).  Additionally, you must provide the run time of your program in seconds as output.  This time must start before you allocate the memory for your matrices and must stop after the matrix addition is complete, but before you print your result.  You will need to use MPI_Wtime() to get the starting time.  MPI time values are displayed in seconds units, and are returned as type double including partial seconds after the decimal point.   You can get the run time by subtracting the start time from a call to MPI_Wtime() after all computations are complete.  Note that you should plan to divide your work between a server process (for example, the process with rank 0) and worker/client processes that will perform matrix computations.  Output should be provided in the server process, and the calls to the MPI wall time function should only be made from the server process.  Do NOT use another function to compute the run time.

Matrix Data

You should use the following variables.  Since you will be using processes you need to remember that these variables will NOT be shared between processes.

const int NUM_THREADS;  //The number of threads.
const int N;  //This is the size of one of the sides of the matrix.  It is possible to read in one value and store it in N.  Once set this value cannot be changed.
#define ninth (1.0/9.0) double ** A;  //The matrix A (dynamically allocated).
static double B[3][3] = {{ninth, ninth, ninth}, {ninth, ninth, ninth}, {ninth, ninth, ninth}};  //The blurring filter.
double ** C;  //The result of blurring A with B is stored in matrix C.

Matrix A may be allocated and initialized in every process.  You should only need to allocate matrix C in the server process. These matrices will be represented as 2D arrays and will be dynamically allocated.  Therefore, these must be of type double **.  The values for the number of processes and the matrix size will be provided as input.  Even though these are constants, if they are not initialized, their values may be set once (for example, by using an command line input value).

Computations (Client/Worker Processes)

You must implement both a server process (rank 0) and client/worker processes (all other ranks) in your program.  Each of the worker processes must compute the  N/(NUM_PROCESSES-1) rows of the matrix C, and send the results to the server process.

Main Function and Server Process

All processes of this program must read in the input data as explained above, and perform error checking on the data.  The server process must print the output (the first and last five rows) of the matrix addition.  Note that the input data must be provided at the command line.  If the input values are invalid as described above, you are to print an error message and exit the program immediately.  Otherwise, allocate memory for the matrix A in every process except the server process.  Every worker process should allocate and fill each element of the matrix A with the current row number.  That is, if you are working with A[1][2], its value should be 1, and if you are working with B[4][3], its value should be 4.  This way it will be quite easy to tell if your matrix blurring is working properly.  Assuming valid input and that the matrices A, B, and C have been created, the program will automatically start the requested number of processes using MPI.  The server process must attempt to receive all the data from the worker processes and store it in matrix C.  It is recommended that you perform non-blocking receives so you gain experience with the constuct.  After calling all of the non-blocking receives, you should call the MPI_Waitall function to wait for the data.  Finally, print the data and the run time as output.

Compilation

Your program must read one input from the command line.  Failure to do so will result in a loss of points on this programming assignment.  Save the source code for the MPI portion of program 2 in a file named prog2Procs.c.  Do NOT give the file a different name.  You will lose points if you give the file any other name than prog2Threads.c.

To compile your program use the following command:

    mpicc prog2Procs.c -O3 -o prog2Procs.exe

This will compile your program using the mpicc compiler, and it will create an executable file named prog2Procs.exe.  Assuming that there are no errors in your program, you may run it using the command sbatch prog2.sbatch, where a sample batch file follows below:

#!/bin/bash
#SBATCH -A TG-SEE120004
#SBATCH -n 16 -N 1
#SBATCH -J prog2Procs
#SBATCH -o prog2Procs.o%j
#SBATCH -p normal
#SBATCH -t 00:15:00
ibrun prog2Procs.exe 1500

You must provide your output file and batch script for running this program on a matrix of size 6300 by 6300 on 4 nodes distributed across 64 processes.  It is highly recommended that you try this program on LittleFe first and then attempt to scale up to using Stampede.

Submitting Your Program:

Note that you must provide informative comments for your code and follow the program guidelines.

To turn in your programs, zip your .c files, your Stampede output files, and your batch script and submit them to the eCompanion dropbox for Program 2.