High Performance Computing Lesson 2

A good introduction to parallel computing memory models is available at LLNL.
Begin parallel programming by copying simple example paradigms and adapting them for your use.
OpenMP

OpenMP is shared memory parallelism, enabled typically by the -fopenmp flag when compiling.
OpenMP parallelism demands care with thread safety
Generally operations which read data are thread-safe but operations which write data may not be thread-safe
One easy way to ensure thread-safety is to ensure each thread uses a different array index
OpenMP code fragment:

	
		#ifdef HAVE_OPENMP
		#pragma omp parallel default(shared)
		#endif
		{
		#ifdef HAVE_OPENMP
		#pragma omp for
		#endif
		for(size_t it=0;it<n_threads;it++) {
		
		// Copy from the initial points array into current point
		size_t ip_size=initial_points.size();
		for(size_t ipar=0;ipar<n_params;ipar++) {
		current[it][ipar]=initial_points[it % ip_size][ipar];
		}
		
		if (it<ip_size) {
		// If we have a new unique initial point, then
		// perform a function evaluation using a function pointer
		func_ret[it]=func[it](n_params,current[it],w_current[it],
                data_arr[it]);
		} else {
		func_ret[it]=0;
		}
		}
		}
		// End of parallel region

MPI ranks are split and do not share memory, thus communication between MPI processes requires a special call to, e.g. MPI_Send() or MPI_Recv().
Must compile with a special compiler wrapper d execute with e.g. mpic++ or mpif90
Code must start with MPI_Init() and end with MPI_Finalize().
MPI code fragment:

		// Get current rank and number of ranks
		int mpi_rank=0, mpi_size=1;
		MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
		MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);
		
		// If necessary, wait for the last rank to complete before
		// we start by sending a simple message. The message is
		// typically a pointer to the top of an array of some
		// specified size. This is a 1-length integer array.
		int tag=0, buffer=0;
		if (mpi_size>1 && mpi_rank>=1) {
		MPI_Recv(&buffer,1,MPI_INT,mpi_rank-1,
		tag,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
		}

		// Read the file here
		ifstream fin("input.txt");
		fin >> temp_string;
		fin.close();

		// Send a message to the next rank to allow it to proceed
		if (mpi_size>1 && mpi_rank<mpi_size-1) {
		MPI_Send(&buffer,1,MPI_INT,mpi_rank+1,
		tag,MPI_COMM_WORLD);
		}

A system physically designed as a shared memory configuration can be logically partitioned into several systems using MPI.

Hybrid OpenMP/MPI

Most systems employ this model and/or involve GPUs (see below)
"Cores per node": Cores indicate maximum number of shared memory threads and the number of nodes used is the "MPI_Size" parameter.
Most systems are inefficient unless each job utilizes the maximum number of cores in each node
Can treat isospin as a hybrid OpenMP/MPI system with e.g. MPI_Size=2 and 2 threads per logical "node" (this is how I debug before moving to an HPC system).

GPUs

Often require entirely new code design and a different compiler

File I/O on HPC systems

Extremely time-consuming
Ensure one read/write for each set of OpenMP threads
Sometimes necessary to send data across MPI ranks to minimize I/O
Another option: parallel HDF5

PBS scripts

Often uses torque/moab.
Job scripts which contain the necessary instructions for the HPC to run a job
Differ for different systems, often requires browsing the HPC documentation

Example (I have intentionally made some mistakes here you will have to fix):

		  #!/bin/bash

		  # Job name. Set this to something useful
		  #PBS -N newmcmc_debug

		  # Select machine (important for ACF)
		  #PBS -l partition=beacon

		  # Account specification (important for ACF)
		  #PBS -A UT-ACF-051

		  # (This setting is different
		  # for each different HPC system, but walltime is a very common
		  # parameter, and is typically of the form HH:MM:SS).
		  #PBS -l nodes=12,walltime=0:20:00

		  # The stdout and stderr files. These are useful in debugging the
		  # code
		  #PBS -e newmcmc_debug.err
		  #PBS -o newmcmc_debug.out

		  # Use an environment variable to change to the proper directory.
		  # Useful on ACF.
		  cd $PBS_O_WORKDIR

		  # Load modules
		  module load gsl
		  # Final entry in PBS file is often the main command to run.
		  # Note absence of & sign and use of 'mpirun'. This command
		  # must often be reworked for each HPC system
		  mpirun -n 2 -ppn=8 ./newmcmc -initial-point last_out -mcmc \
		  > newmcmc_debug.scr 2> newmcmc_debug.err

Submitting jobs

qsub [file.pbs]
qdel [job id]
qstat -u [username] or qstat -a
showstart [job id]

Comparison of pbs and slurm here
HW 2: Create a .pbs script for a small bamr run and set up the required files in lustre based on the bamr command
```
		bamr -set max_time 900 -set prefix bamr_debug -run default.in \
		-model twop -mcmc
	      
```
(but don't submit the job yet). Notice I use 900 seconds rather than the full 20 minutes?