1. 84s/3 -> 28s (one third of the program is serial)
         -> 56s (two thirds is parallelizable)
    for 4 processors parallel run time is 42s = 28s + 56s/4
    speedup for 4 processors is 84s / 42s = 2
    for 8 processors parallel run time is  35s = 28s + 56s/8
    speedup for 8 processors is 84s / 35s = 2.4
    for 16 processors parallel run time is  31.5s = 28s + 56s/16
    speedup for 16 processors is 84s / 31.5s = 2.67

    Theoretical shortest run time is 28s.  Theoretical maximum speed up is 3.


2. Any reasonable answers were acceptable provided you listed the name of the tool, link, type (synthetic/non-synthetic), results, and description.

Some tools included SiSoftware Sandra, NovaBench, PassMark, etc.


3. a) An example program follows below.  Other answers are acceptable.

import java.util.*;

public class FPBenchmark
{
   public static void main(String [] args)
   {
     final double ONE_NS = 1000000000;
     final int TIMES_TO_RUN = 10;

     int seed = 12345;
     Random rgen = new Random(seed);
     double x = 0, y = 0, f = 0;
     double avgTime = 0.0;

     for(int n = 0; n < TIMES_TO_RUN; n++)
     {
       long start = System.nanoTime();

       for(int i = 0; i < 10000000; i++)
       {
         x = rgen.nextDouble();
         y = rgen.nextDouble();

         f = (1-x)*(1-x) + 100*(y - x*x)*(y - x*x);
       }
       long end = System.nanoTime();
       avgTime += (end - start)/ONE_NS;

     }
     System.out.println("Average time = " + avgTime/TIMES_TO_RUN + "s");
   }
}

Avg run time on a 2.4 GHz Core i7 = 0.4384s
Avg run time on a 2.5 GHz AMD Opteron = 0.4651s

b) Results below are on a laptop with a i7-4960HQ processor.

run time: 0.155512 s - 2 Threads
run time: 0.069136 s - 4 Threads
run time: 0.050390 s - 6 Threads
run time: 0.039552 s - 8 Threads

These results show a reduction in run time with an increase in the number of threads used.  Generally, the tool used to parallelize the code for this problem, OpenMP, allows for a programmer to parallelize a loop or code segment by adding one or two #pragma statements.

4. Results of internet search:

The Xeon Phi 7120X contains 61 x86 (standard intel architecture) cores that run at 1.238GHz.  It also provides 16GB main memory and utilizes a basic Linux operating system.  The performance claim of 1.208 TFLOP is based upon the peak theoretical double precision performance for a single
co-processor.  The use of general purpose x86 cores makes the unit well suited for both mathematical and logical operations.

Reference:
http://www.intel.com/content/www/us/en/high-performance-computing/high-perfo
rmance-xeon-phi-coprocessor-brief.html [www.intel.com]

The NVidia Tesla K80 contains 4992 CUDA cores and 24GB of main memory. This General Purpose Graphics Processing Unit utilizes NVidia's GK110 GPU, which runs at 706 MHz.  The GK110 may include 13 to 15 streaming multiprocessors (SMX's).  Within each SMX is 192 single precision CUDA cores, 64 double precision units, 32 special function units, and 32 load/store units.  CUDA cores are units on an SMX that are capable of executing thread instructions.  Essentially the CUDA processors are small mathematics processors.  The high quantity of mathematics processors
makes the K80 well suited to mathematics problems that may be parallelized such as matrix mathematics or image processing.

References:

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Wh
itepaper.pdf [www.nvidia.com]
http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.
pdf [www.nvidia.com]

Why do these units report 2.91 and 1.2TFLOP double precision performance, respectively?

Both units report _peak_ performance of over 1 TFLOPs for double precision computation because this is the maximum theoretical performance of both units.  This is the best the units can do using their fastest operations.

Are the units best suited for the same types of tasks?

The architectures are both well suited for highly parallel mathematical computations; however, GPGPUs (i.e. NVidia Kepler) excel at purely mathematical computations that can be broken into many small "chunks" while the Intel unit utilizes many general purpose CPUs, which may be
programmed to work with both logical and mathematical problems.  Choice of architecture is highly dependent upon problem type.

Xeon Phi cores represent fully functional general processing units whereas CUDA cores represent components of a stream processing unit which must all perform (or ignore) the same operation at the same time.

5. (0.5 add * 4 cycles/add) + (0.29 jump * 1 cycle/jump) + (0.11 div * 8 cycles/div) + (0.1 move * 4 cycles/move) = 2.0 + 0.29 + 0.88 + 0.4 = 3.57 cycles ( per instruction avg. )

300,000 instructions * 3.57 CPI = 1,071,000 cycles

1,071,000 cycles / 2.2 GHz

=> 1.071 * 10^6 cycles / 2.200 * 10^9 cycles/sec = 4.868 * 10^-4 s