High Performance Computing Lesson 2

  • A good introduction to parallel computing memory models is available at LLNL.
  • Begin parallel programming by copying simple example paradigms and adapting them for your use.
  • OpenMP
    • OpenMP is shared memory parallelism, enabled typically by the -fopenmp flag when compiling.
    • OpenMP parallelism demands care with thread safety
    • Generally operations which read data are thread-safe but operations which write data may not be thread-safe
    • One easy way to ensure thread-safety is to ensure each thread uses a different array index
    • OpenMP code fragment:
    • 	
      		#ifdef HAVE_OPENMP
      		#pragma omp parallel default(shared)
      		#endif
      		{
      		#ifdef HAVE_OPENMP
      		#pragma omp for
      		#endif
      		for(size_t it=0;it<n_threads;it++) {
      		
      		// Copy from the initial points array into current point
      		size_t ip_size=initial_points.size();
      		for(size_t ipar=0;ipar<n_params;ipar++) {
      		current[it][ipar]=initial_points[it % ip_size][ipar];
      		}
      		
      		if (it<ip_size) {
      		// If we have a new unique initial point, then
      		// perform a function evaluation using a function pointer
      		func_ret[it]=func[it](n_params,current[it],w_current[it],
                      data_arr[it]);
      		} else {
      		func_ret[it]=0;
      		}
      		}
      		}
      		// End of parallel region
      	      
  • MPI
    • MPI ranks are split and do not share memory, thus communication between MPI processes requires a special call to, e.g. MPI_Send() or MPI_Recv().
    • Must compile with a special compiler wrapper d execute with e.g. mpic++ or mpif90
    • Code must start with MPI_Init() and end with MPI_Finalize().
    • MPI code fragment:
    • 		// Get current rank and number of ranks
      		int mpi_rank=0, mpi_size=1;
      		MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
      		MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);
      		
      		// If necessary, wait for the last rank to complete before
      		// we start by sending a simple message. The message is
      		// typically a pointer to the top of an array of some
      		// specified size. This is a 1-length integer array.
      		int tag=0, buffer=0;
      		if (mpi_size>1 && mpi_rank>=1) {
      		MPI_Recv(&buffer,1,MPI_INT,mpi_rank-1,
      		tag,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
      		}
      
      		// Read the file here
      		ifstream fin("input.txt");
      		fin >> temp_string;
      		fin.close();
      
      		// Send a message to the next rank to allow it to proceed
      		if (mpi_size>1 && mpi_rank<mpi_size-1) {
      		MPI_Send(&buffer,1,MPI_INT,mpi_rank+1,
      		tag,MPI_COMM_WORLD);
      		}
      	      
    • A system physically designed as a shared memory configuration can be logically partitioned into several systems using MPI.
  • Hybrid OpenMP/MPI
    • Most systems employ this model and/or involve GPUs (see below)
    • "Cores per node": Cores indicate maximum number of shared memory threads and the number of nodes used is the "MPI_Size" parameter.
    • Most systems are inefficient unless each job utilizes the maximum number of cores in each node
    • Can treat isospin as a hybrid OpenMP/MPI system with e.g. MPI_Size=2 and 2 threads per logical "node" (this is how I debug before moving to an HPC system).
  • GPUs
    • Often require entirely new code design and a different compiler
  • File I/O on HPC systems
    • Extremely time-consuming
    • Ensure one read/write for each set of OpenMP threads
    • Sometimes necessary to send data across MPI ranks to minimize I/O
    • Another option: parallel HDF5
  • PBS scripts
    • Often uses torque/moab.
    • Job scripts which contain the necessary instructions for the HPC to run a job
    • Differ for different systems, often requires browsing the HPC documentation
    • Example (I have intentionally made some mistakes here you will have to fix):
      		  #!/bin/bash
      
      		  # Job name. Set this to something useful
      		  #PBS -N newmcmc_debug
      
      		  # Select machine (important for ACF)
      		  #PBS -l partition=beacon
      
      		  # Account specification (important for ACF)
      		  #PBS -A UT-ACF-051
      
      		  # (This setting is different
      		  # for each different HPC system, but walltime is a very common
      		  # parameter, and is typically of the form HH:MM:SS).
      		  #PBS -l nodes=12,walltime=0:20:00
      
      		  # The stdout and stderr files. These are useful in debugging the
      		  # code
      		  #PBS -e newmcmc_debug.err
      		  #PBS -o newmcmc_debug.out
      
      		  # Use an environment variable to change to the proper directory.
      		  # Useful on ACF.
      		  cd $PBS_O_WORKDIR
      
      		  # Load modules
      		  module load gsl
      		  # Final entry in PBS file is often the main command to run.
      		  # Note absence of & sign and use of 'mpirun'. This command
      		  # must often be reworked for each HPC system
      		  mpirun -n 2 -ppn=8 ./newmcmc -initial-point last_out -mcmc \
      		  > newmcmc_debug.scr 2> newmcmc_debug.err
      		
  • Submitting jobs
    • qsub [file.pbs]
    • qdel [job id]
    • qstat -u [username] or qstat -a
    • showstart [job id]
  • Comparison of pbs and slurm here
  • HW 2: Create a .pbs script for a small bamr run and set up the required files in lustre based on the bamr command
    		bamr -set max_time 900 -set prefix bamr_debug -run default.in \
    		-model twop -mcmc
    	      
    (but don't submit the job yet). Notice I use 900 seconds rather than the full 20 minutes?

Back to Andrew W. Steiner at the University of Tennessee.