Lab 11 (5 points)
        CS550, Operating Systems
        Caching
     Name:
      _____________________________________________ 
    To submit this assignment, you may copy and paste parts of the
      assignment into a text editor such as nano, vi, notepad, MS Word,
      OpenOffice Writer, etc.  Zip any code and scripts you create
      showing the output of your solutions, and submit the zip file to
      the dropbox for lab 11.  Be sure to include a text document
      including any written/typed/graphed results. You may work with a
      partner on this lab, but each person must submit his/her own
      solution.
    
    The following lab is based in part upon labs provided at the CUDA
    and C++ 11 sessions from the SC13 conference.  Within this lab,
    you will work with a matrix multiplication program and learn about
    the effects of data locality within the CPU cache, and how this may
    be indirectly affected depending upon the order in which data is
    accessed.
    
    Within this lab, you will test scaling and caching by using matrix
    multiplication.
    
    Download the files at the following link.
    
    Review the batch file provided below.
    
      #!/bin/bash
      #SBATCH -A TG-SEE120004
      #SBATCH -n 16
      #SBATCH -J matMult
      #SBATCH -o mm.o%j
      #SBATCH -p development
      #SBATCH -t 00:15:00
      echo 'Starting job'
      ibrun mmnf.exe 5000
      echo 'Completed job'
    
    1. What are the command line parameters provided in this batch file?
    
    2. What do you think the echo command does?
    
    Review the two C programs provided in the zip file.
    
    3. What is the purpose of the code in each file?
    
    4. What design pattern is used within the code?
    
    5. What data is sent in each process?
    
    6. What data is received in each process?
    
    7. Are all of these data transfers necessary?  Explain.
    
    8. Compile and run the code on Stampede using the batch scripts
    provided.  Record the run time of your results.  
    
    9. Modify the number of cores used to 16, 32, 128, 256, and
    512.  Record and graph these run times vs the number of
    processors, including the results from problem 8.
    
    10. Explain the results.  Consider caching and data locality in
    your answer.  Hint: consider the difference in memory location
    between matrix data in the same row - A[i][j] and A[i][j+1] vs
    matrix data in the same column A[i+1][j] and A[i][j].  In which
    case would both values likely be pulled into cache?