Homework 2
      CS 345
      Computer Organization
      10 Points
      Due Friday, Jan. 30, 2015 at the beginning of class
      You must submit a hard copy of this assignment or turn it in to the dropbox and show all work!
    
    
    1. Assume you have a program that runs in 84s.  Assume that
    two-thirds
    of the program can be parallelized.  What is the run time and
    speedup if the parallelism is implemented ideally using 4
    processors? 8
    processors? 16 processors? What is the theoretical shortest run time
    for the program? Show all work.
    
    2. Find two free (or at least free to try) benchmarking tools. 
    Hint: one such tool is on the links page of the class website. 
    For each tool, answer the following questions.
        What is the benchmark tool called and where can
    it be found?  Provide a URL.
        Is the tool synthetic?
        What kind(s) of measurements are performed by the
    tool?
        What are the results of running the tool on two
    different computers?
        Summarize your experiments, explaining whether or
    not you think the tools are useful and in which situations.
    
    3. a) Write your own simple benchmarking tool to test a particular
    floating point operation (e.g. addition, subtraction,
    multiplication, division, sine, cosine, etc.).  Your tool
    should run the same set of
    computations for a significant number of iterations (e.g. 10
    million,
    100 million, 1 billion iterations, etc.) and time the result using a
    system utility such
    as System.nanoTime().  Run your program several times on your
    laptop computer.  Be sure to average the results.  How did
    your results compare to those from the previous
    question?  Explain.
    
    b) Log on to the LittleFe cluster computer at
    littlefe2.nwmissouri.edu using the PuTTY login.  Login
    information will be provided in class.  Instructions
    to use the system are provided at the following links: PuTTY.html, LittleFeTutorial.html . 
    You do not need to submit the tutorial files as part of this
    homework.
    
    After learning how to log in and create a .c file, create a file
    called OMPTest.c with the following code:
    
    #include
      <stdio.h>
      #include <time.h>
      #include <omp.h>
      
      #define NUM_ITERATIONS 100000000L
      #define NUM_THREADS 8
      int main(int argc, char ** argv)
      {
        long i;
        double x = 3.8;
      
        double start = clock()/(double)CLOCKS_PER_SEC;
      
        #pragma omp parallel for num_threads(NUM_THREADS)
        for(i = 0; i < NUM_ITERATIONS/NUM_THREADS; i++)
          x = 8.2 * 7.2 / 1000.24;
      
        printf("run time: %lf
      s\n",clock()/((double)CLOCKS_PER_SEC*NUM_THREADS));
      
        return 0;
      }
    
    Compile the following code on LittleFe using the following command:
    
    gcc OMPTest.c -fopenmp
      -o OMPTest.exe
    
    Run the OMPTest.exe as follows:
    
    ./OMPTest.exe
    
    Change NUM_THREADS to 6, 4, and 2, recompile and re-run the
    program.  Record the run times of each, run and explain the
    results.
    
    4. The newest supercomputers are able to offload computations to a
    graphics card such as the NVIDIA Tesla K80 or a co-processor such as
    the Xeon Phi.  Processor counts are extremely different between
    these two different architectures: 4992 CUDA cores and 24GB RAM for the NVIDIA
    card
    vs. 60 processors for the Xeon Phi.  Both report approximately
    2.91 and 1.2
    TFLOP double-precision performance, respectively.  Why is this
    the case? 
    Are these two architectures best suited for the same types of
    tasks?  What kinds of tasks are these? Explain.  Be sure
    to perform an internet search on
    both products, examine their specifications, and determine the
    difference between a CUDA core (processor) and a Xeon Phi core
    (processor).
    
    5.  Given a program that makes use of 300,000 instructions half
    of
    which are ADD, 29% are JUMP, 11% are DIV, and 10% are MOVE, compute
    the user run time of
    the program on a 2.2 GHz machine computing the total cycles first
    and then the run time.  Assume that ADD instructions
    cost 4 cycles, JUMP instructions cost 1 cycle, DIV instructions cost
    8 cycles, and MOVE instructions cost 4 cycles.