Homework 2
CS 345
Computer Organization
10 Points
Due Friday, Jan. 30, 2015 at the beginning of class
You must submit a hard copy of this assignment or turn it in to the dropbox and show all work!

1. Assume you have a program that runs in 84s.  Assume that two-thirds of the program can be parallelized.  What is the run time and speedup if the parallelism is implemented ideally using 4 processors? 8 processors? 16 processors? What is the theoretical shortest run time for the program? Show all work.

2. Find two free (or at least free to try) benchmarking tools.  Hint: one such tool is on the links page of the class website.  For each tool, answer the following questions.
    What is the benchmark tool called and where can it be found?  Provide a URL.
    Is the tool synthetic?
    What kind(s) of measurements are performed by the tool?
    What are the results of running the tool on two different computers?
    Summarize your experiments, explaining whether or not you think the tools are useful and in which situations.

3. a) Write your own simple benchmarking tool to test a particular floating point operation (e.g. addition, subtraction, multiplication, division, sine, cosine, etc.).  Your tool should run the same set of computations for a significant number of iterations (e.g. 10 million, 100 million, 1 billion iterations, etc.) and time the result using a system utility such as System.nanoTime().  Run your program several times on your laptop computer.  Be sure to average the results.  How did your results compare to those from the previous question?  Explain.

b) Log on to the LittleFe cluster computer at littlefe2.nwmissouri.edu using the PuTTY login.  Login information will be provided in class.  Instructions to use the system are provided at the following links: PuTTY.html, LittleFeTutorial.html .  You do not need to submit the tutorial files as part of this homework.

After learning how to log in and create a .c file, create a file called OMPTest.c with the following code:

#include <stdio.h>
#include <time.h>
#include <omp.h>

#define NUM_ITERATIONS 100000000L
#define NUM_THREADS 8
int main(int argc, char ** argv)
{
  long i;
  double x = 3.8;

  double start = clock()/(double)CLOCKS_PER_SEC;

  #pragma omp parallel for num_threads(NUM_THREADS)
  for(i = 0; i < NUM_ITERATIONS/NUM_THREADS; i++)
    x = 8.2 * 7.2 / 1000.24;

  printf("run time: %lf s\n",clock()/((double)CLOCKS_PER_SEC*NUM_THREADS));

  return 0;
}


Compile the following code on LittleFe using the following command:

gcc OMPTest.c -fopenmp -o OMPTest.exe

Run the OMPTest.exe as follows:

./OMPTest.exe

Change NUM_THREADS to 6, 4, and 2, recompile and re-run the program.  Record the run times of each, run and explain the results.

4. The newest supercomputers are able to offload computations to a graphics card such as the NVIDIA Tesla K80 or a co-processor such as the Xeon Phi.  Processor counts are extremely different between these two different architectures: 4992 CUDA cores and 24GB RAM for the NVIDIA card vs. 60 processors for the Xeon Phi.  Both report approximately 2.91 and 1.2 TFLOP double-precision performance, respectively.  Why is this the case?  Are these two architectures best suited for the same types of tasks?  What kinds of tasks are these? Explain.  Be sure to perform an internet search on both products, examine their specifications, and determine the difference between a CUDA core (processor) and a Xeon Phi core (processor).

5.  Given a program that makes use of 300,000 instructions half of which are ADD, 29% are JUMP, 11% are DIV, and 10% are MOVE, compute the user run time of the program on a 2.2 GHz machine computing the total cycles first and then the run time.  Assume that ADD instructions cost 4 cycles, JUMP instructions cost 1 cycle, DIV instructions cost 8 cycles, and MOVE instructions cost 4 cycles.