Wednesday, 19 April 2017

False Sharing

Calculation of pi:

The program will numerically compute the integral of 4/(1+x*x) from 0 to 1. 
The value of this integral is pi.


Serial Code:

The serial implementation of this code is here
The serial code is faster than the parallel code.
We will see the parallel code soon.


Parallel Code using OpenMP :

The program is parallelized using OpenMP and an SPMD algorithm

The program will show low performance due to false sharing. In particular, sum[id] is unique to each thread, but adjacent values of this array share a cache line causing cache thrashing as the program runs.

You can see the complete code here


Output:

Serial Code, Output:



Parallel code using OpenMP, output:



we can clearly see, every time it is taking more time than the serial code.

Experimental Setup:

4-Core Intel I7 @ 3.07GHz 16 GB RAM
GCC compiler - 4.9.4

Why this is slower:

The OpenMP program will show low performance due to false sharing
In particular, sum[id] is unique to each thread, but adjacent values of this array share a cache line causing cache thrashing as the program runs.





Tuesday, 18 April 2017

Matrix Multiplication using MapReduce

Matrix Multiplication:


To multiply two matrixes sufficient and necessary condition is "number of columns in matrix A = number of rows in matrix B"


How serial code works:


Loop for each row in matrix A.Loop for each column in matrix B and initialize output matrix C to 0. This loop will run for each row of matrix A.Loop for each column in matrix A.Multiply A[i,k] to B[k,j] and add this value to C[i,j]Return output matrix C


Map-Reduce logic:



 The logic here is to send the calculation part of each output cell of the result matrix to a reducer.  
In the matrix multiplication the first cell of output  (0,0) has multiplication and summation of elements from row 0 of the matrix A and elements from col 0 of matrix B.  To do the computation of value  in the output cell (0,0) of resultant matrix  in a separate reducer we need to use (0,0) as output key of map phase and value should have array of values from row 0 of matrix A and column 0 of matrix B.

This algorithm's output from map phase should be having a <key,value> , 

To multiply an nxn matrix M with an n-element vector v, compute 




v fits into memory
map: output (i, mijvj)

reduce: sum

Implementation

Map
In the map function, each input from the dataset is organised to produce a key value pair such that reducer can do the entire computation of the corresponding output cell. The code is given here.



Reducer
Input to the reducer is the key that corresponds to the output cell of resultant matrix and values required to do the computation of that cell. The code is given here.


Output:

You can see the output here

Experimental Setup:

4-Core Intel I7 @ 3.07GHz 16 GB RAM
Hadoop version - 2.0
Java version - 1.8

How to run?

Follow this link

Why this will be faster:

This implementation will work in many phases, it can have multiple maps and reduce.So this can lead to fast computation in the map and finalise the value in reduce.
Map and reduce work individually so this is faster than the serial code.







Estimation of the PI value:

How to do:

Generate N random points in the unit square. Count M, the number of points that are in the quarter circle. Then PI is approximately equal to the ratio 4 * M / N.

First, we will see the implementation of this in serial fashion:

Each time a single processor is generating the random number and it calculates the value of PI using those values.
It takes a large amount of time, because each time in iteration it should generate a random number and calculate the PI value.

You can see the serial implementation of this, here

Now let us parallelize it using MPI:

Each processor uses different random numbers.We should ensure this is to have a single master processor generate all the random numbers, and then divide them up.

You can see the implementation of the above program using MPI, here

With parallelising using MPI, we have got a huge difference in time, this implementation is much faster than the serial code.


By the graph we can clearly see the difference between the time taken for change in number of processors.(Np-number of process)

Experimental Setup:


4-Core Intel I7 @ 3.07GHz 16 GB RAM
Hadoop version - 2.0.3
mpicc version - 2.0.2

How to run?

Follow this link

Why this is faster:


Each processor will calculate the PI value individually, so the speed up increases n times (if we have n processors)

You can see the output of the program here




Tuesday, 31 January 2017

What is Parallel computing?

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously.

In simple terms we will provide an example:
Suppose you and your 3 other friends are asked to do an assignment, in the assignment have 4 question, each of yours team member will answer 1 of the question simultaneously, I will call it as parallel processing/computing!

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:
  • A problem is broken into discrete parts that can be solved concurrently
  • Each part is further broken down to a series of instructions
  • Instructions from each part execute simultaneously on different processors
  • An overall control/coordination mechanism is employed
Advantages of Parallel Computing:

It usually gets more
  • Computational power
  • fault tolerance
  • Larger amount of memory
  • Speed up factor

Parallel Computing vs Concurrent Computing

Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant. Eg. multitasking on a single-core machine.
Concurrent execution is possible on single processor









Parallelism is when tasks literally run at the same time, eg. on a multicore processor.
Parallel execution is not possible on single processor but on multiple processors.
    


Monday, 30 January 2017

Before diving into Parallel computing:


Without using any parallel computing techniques, we can make our code faster by making it cache friendly.
Let us see an example :

one classic example is to iterate a multidimensional array "inside out":







The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:



It will run much faster.

Friday, 20 January 2017

Parallel Computing Models


1. Shared Memory Model (without threads)
2. Threads Model

3. Distributed Memory / Message Passing Model

4. Data Parallel Model

5. Hybrid Model

6. SIMD and MIMD


Shared Memory Model (without threads):

This programming model, processes/tasks share a common address space, which they read and write to asynchronously.

Various mechanisms such as locks/semaphores are used to control access to the shared memory, resolve contentions and to prevent race conditions and deadlocks.






Disadvantages in terms of performance:
  • Keeping data local to the process that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processes use the same data.

Threads Model:

This programming model is a type of shared memory programming.

In this model of parallel programming, a single "heavy weight" process can have multiple "light weight", concurrent execution paths.

For example,

  • The main program a.out is scheduled to run by the native operating system. a.out loads and acquires all of the necessary system and user resources to run. This is the "heavyweight" process.
  • a.out performs some serial work and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently.
  • Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread ("lightweight"). Each thread also benefits from a global memory view because it shares the memory space of a.out.
  • A thread's work may best be described as a subroutine within the main program. Any thread can execute any subroutine at the same time as other threads.
  • Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time.
  • Threads can come and go, but a.out remains present to provide the necessarily shared resources until the application has completed.
Implementations:

The programmer is responsible for determining the parallelism (although compilers can sometimes help)

Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP.
  • POSIX Threads
    • Specified by the IEEE POSIX 1003.1c standard (1995). C Language only.
    • Part of Unix/Linux operating systems
    • Library based
    • Commonly referred to as Pthreads.
    • Very explicit parallelism; requires significant programmer attention to detail.
  • OpenMP
    • Industry standard jointly defined and endorsed by a group of major computer hardware and software vendors, organizations, and individuals.
    • Compiler directive based
    • Portable / multi-platform, including Unix and Windows platforms
    • Available in C/C++ and Fortran implementations
    • Can be very easy and simple to use - provides for "incremental parallelism". Can begin with the serial code.
  • Other threaded implementations exist, such as Microsoft's
More Information:

Distributed Memory / Message Passing Model:

This model demonstrates the following characteristics:

  • A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines.
  • Tasks exchange data through communications by sending and receiving messages.
  • Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.
Implementations:

From a programming perspective, message passing implementations usually comprise a library of subroutines. Calls to these subroutines are embedded in the source code. The programmer is responsible for determining all parallelism.

Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was released in 1996 and MPI-3 in 2012. All MPI specifications are available on the web at http://www.mpi-forum.org/docs/.

More Information:

Data Parallel Model:

The data parallel model demonstrates the following characteristics:

  • Address space is treated globally
  • Most of the parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an array or cube.
  • A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.
  • Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".




Hybrid Model:

  • A hybrid model combines more than one of the previously described programming models.
  • Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with the threads model (OpenMP).
    • Threads perform computationally intensive kernels using local, on-node data
    • Communications between processes on different nodes occurs over the network using MPI
  • This hybrid model lends itself well to the most popular (currently) hardware environment of clustered multi/many-core machines.
  • Another similar and increasingly popular example of a hybrid model is using MPI with CPU-GPU (Graphics Processing Unit) programming.
    • MPI tasks run on CPUs using local memory and communicating with each other over a network.
    • Computationally intensive kernels are off-loaded to GPUs on-node.
    • Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).
  • Other hybrid models are common:
    • MPI with Pthreads
    • MPI with non-GPU accelerators
    • ...

SPMD and MPMD:

Single Program Multiple Data (SPMD):

  • SPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models.
  • SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.
  • MULTIPLE DATA: All tasks may use different data
  • SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have to execute the entire program - perhaps only a portion of it.
  • The SPMD model, using message passing or hybrid programming, is probably the most commonly used parallel programming model for multi-node clusters.
Multiple Program Multiple Data (MPMD):

  • Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models.
  • MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid.
  • MULTIPLE DATA: All tasks may use different data
  • MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition.

Thursday, 12 January 2017

Designing Parallel Programs:

  • Identify the program's Hotspots:
    • Know where most of the real work is being done. The majority of scientific and technical programs usually accomplish most of their work in a few places.
    • Profilers and performance analysis tools can help here
    • Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage.
  • Identify bottlenecks in the program:
    • Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For example, I/O is usually something that slows a program down.
    • May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary slow areas
  • Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.

Wednesday, 11 January 2017

OpenMp implementation of the Dining philosophers problem

Dining philosophers problem:

Five silent philosophers sit at a round table with bowls of spaghetti. Forks are placed between each pair of adjacent philosophers.
Each philosopher must alternately think and eat. However, a philosopher can only eat spaghetti when they have both left and right forks. Each fork can be held by only one philosopher and so a philosopher can use the fork only if it is not being used by another philosopher. After an individual philosopher finishes eating, they need to put down both forks so that the forks become available to others. A philosopher can take the fork on their right or the one on their left as they become available, but cannot start eating before getting both forks.
Eating is not limited by the remaining amounts of spaghetti or stomach space; an infinite supply and an infinite demand are assumed.
The problem is how to design a discipline of behavior (a concurrent algorithm) such that no philosopher will starve; i.e., each can forever continue to alternate between eating and thinking, assuming that no philosopher can know when others may want to eat or think


Here is the link for the OpenMp implementation of the Dining philosophers problem: (See the comments for explanation)

https://github.com/DHEERAJRK/Parallel-computing/blob/master/opmdining.c

Monday, 9 January 2017

CUDA Programming Basics

CUDA Programming Model Basics:

The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
  1. Declare and allocate host and device memory.
  2. Initialize host data.
  3. Transfer data from the host to the device.
  4. Execute one or more kernels.
  5. Transfer results from the device to the host.
Let us look at some of the CUDA programs in c:

1. CUDA C program that add two array of elements and store the result in third array.


Problem statement: You are given two array with integer/real values, your task is to add to elements of both the array and store in another array.

Solution: The main task is to write CUDA kernel for that, writing kernel is not a big task. Let see how
Let say we have N elements in an array which is represent by “arraySize” (here it is = 5, change accordingly). We have two Function named addWithCuda (…); for invoking kernel and allocating memory on device.d_a and d_b is the device array for storing elements and d_c is the array which stores sum of both array d_a and d_b.
We launch arraysize number threads to add elements of array.
So, adding corresponding elements is quite easy. Just keep tracking the thread Id and we are done. We store id of thread within the block in “I” and adding both of the array element respectively

Here is the link for the code :



2. CUDA C program for Matrix Multiplication :


Problem statement:
To multiply two matrixes sufficient and necessary condition is "number of columns in matrix A = number of rows in matrix B".

Solution: 
Loop for each row in matrix A.Loop for each columns in matrix B and initialize output matrix C to 0. This loop will run for each rows of matrix A.Loop for each columns in matrix A.Multiply A[i,k] to B[k,j] and add this value to C[i,j]Return output matrix C.

Here is the link for the code :