Wednesday, 19 April 2017

False Sharing

Calculation of pi:

The program will numerically compute the integral of 4/(1+x*x) from 0 to 1. 
The value of this integral is pi.


Serial Code:

The serial implementation of this code is here
The serial code is faster than the parallel code.
We will see the parallel code soon.


Parallel Code using OpenMP :

The program is parallelized using OpenMP and an SPMD algorithm

The program will show low performance due to false sharing. In particular, sum[id] is unique to each thread, but adjacent values of this array share a cache line causing cache thrashing as the program runs.

You can see the complete code here


Output:

Serial Code, Output:



Parallel code using OpenMP, output:



we can clearly see, every time it is taking more time than the serial code.

Experimental Setup:

4-Core Intel I7 @ 3.07GHz 16 GB RAM
GCC compiler - 4.9.4

Why this is slower:

The OpenMP program will show low performance due to false sharing
In particular, sum[id] is unique to each thread, but adjacent values of this array share a cache line causing cache thrashing as the program runs.





Tuesday, 18 April 2017

Matrix Multiplication using MapReduce

Matrix Multiplication:


To multiply two matrixes sufficient and necessary condition is "number of columns in matrix A = number of rows in matrix B"


How serial code works:


Loop for each row in matrix A.Loop for each column in matrix B and initialize output matrix C to 0. This loop will run for each row of matrix A.Loop for each column in matrix A.Multiply A[i,k] to B[k,j] and add this value to C[i,j]Return output matrix C


Map-Reduce logic:



 The logic here is to send the calculation part of each output cell of the result matrix to a reducer.  
In the matrix multiplication the first cell of output  (0,0) has multiplication and summation of elements from row 0 of the matrix A and elements from col 0 of matrix B.  To do the computation of value  in the output cell (0,0) of resultant matrix  in a separate reducer we need to use (0,0) as output key of map phase and value should have array of values from row 0 of matrix A and column 0 of matrix B.

This algorithm's output from map phase should be having a <key,value> , 

To multiply an nxn matrix M with an n-element vector v, compute 




v fits into memory
map: output (i, mijvj)

reduce: sum

Implementation

Map
In the map function, each input from the dataset is organised to produce a key value pair such that reducer can do the entire computation of the corresponding output cell. The code is given here.



Reducer
Input to the reducer is the key that corresponds to the output cell of resultant matrix and values required to do the computation of that cell. The code is given here.


Output:

You can see the output here

Experimental Setup:

4-Core Intel I7 @ 3.07GHz 16 GB RAM
Hadoop version - 2.0
Java version - 1.8

How to run?

Follow this link

Why this will be faster:

This implementation will work in many phases, it can have multiple maps and reduce.So this can lead to fast computation in the map and finalise the value in reduce.
Map and reduce work individually so this is faster than the serial code.







Estimation of the PI value:

How to do:

Generate N random points in the unit square. Count M, the number of points that are in the quarter circle. Then PI is approximately equal to the ratio 4 * M / N.

First, we will see the implementation of this in serial fashion:

Each time a single processor is generating the random number and it calculates the value of PI using those values.
It takes a large amount of time, because each time in iteration it should generate a random number and calculate the PI value.

You can see the serial implementation of this, here

Now let us parallelize it using MPI:

Each processor uses different random numbers.We should ensure this is to have a single master processor generate all the random numbers, and then divide them up.

You can see the implementation of the above program using MPI, here

With parallelising using MPI, we have got a huge difference in time, this implementation is much faster than the serial code.


By the graph we can clearly see the difference between the time taken for change in number of processors.(Np-number of process)

Experimental Setup:


4-Core Intel I7 @ 3.07GHz 16 GB RAM
Hadoop version - 2.0.3
mpicc version - 2.0.2

How to run?

Follow this link

Why this is faster:


Each processor will calculate the PI value individually, so the speed up increases n times (if we have n processors)

You can see the output of the program here