Nicholas Henke
Liniac Project, University of Pensylvania
http://www.liniac.upenn.edu
Date: 12/15/2003, HEAD
This guide will help you to get familiar with the user environment, scheduling commands, and job parameters that are used by CLUBMASK. If you have any ideas or questions about this document, please feel free to contact us.
There are various ways to get in contact with the Clubmask developers.
Clubmask Home Page
Clubmask's Sourceforge.net Project Page
Clubmask Mailing Lists
Clubmask Bugs and Feature Requests
#clubmask channel on IRC server, irc.freenode.net
For normal problems and requests, it is most effective to subscribe first read the archives for the clubmask-users mailing list, and if the issue has not been addressed, to email the list. For help in debugging strange or complex issues, you may first check to see if we are on IRC, as we are most days during business hours ( Eastern Standard Time).
This is just a quick note on the style of this guide. There should be a lot more content here, but for now just the important stuff. In this guide, there are examples, and you may notice lines ending in '\'. This is not a literal character to quote, rather an indication of a line that is too long being continued on the next line. In the following example, the command is really one line, but appears on 2 lines.
There are a few basic conceps that need to be understood before using a cluster.
A node is a machine. The machine may have multiple processors, different network adapters, and varying amounts of ram and diskspace, but all in all, it is just a machine. Simple eh?
A job is a basic 'unit of work'. Jobs typically consist of common computations grouped together. There are two main types of jobs that can be run on a cluster.
A Batch job is a job that consists of computations that are run from a shell script which runs individual computations on nodes. Batch jobs are usually scripts that start the computations on the nodes, without any user interaction. A Batch job can also be a script to setup and run an MPI program, or someother type of paralell job.
An Interactive job is a job where the nodes are reserved for the user to log into and issues commands by hand. Interactive jobs should only be used for testing batch scripts, as they will decrease the utilization of your cluster.
In Batch style jobs, there is a script that is written and submitted to the scheduler that does of the work that the job needs to do. In comparison, in an Interactive job, the scheduler merely reserves the nodes and changes the permissions on each so that you can run commands on the nodes. Interactive jobs are usually used so that a person can become familiar with the computing environment and test their code before attempting a long production run.
When a script is submitted to the scheduler, the scheduler will take the request, and process it through it's algorithms and determine on which nodes it will run, and at what time the job will start. The same is done for an Interactive job as well. must
A common misconception is that a Batch or MPI style job will just setup the nodes for a job, and then allow you to run commands from the command line. This is simply not true. If you are going to use Batch or MPI jobs, the script that is submitted to the scheduler do all of the work. The scheduler will automatically detect when the script has exited, and return the nodes to the scheduler for further use.
When submitting a script to the scheduler, you must remember that this script will be copied to the Clubmask spool directory. If you want to use files or programs relative to the script, you will need to code in the paths, or change the working directory via cmsubmit.
When a job is submitted, cmsubmit will take a snapshot of your current environment. This snapshot will be used to setup the environement for the job just before executing your job script.
In addition to your predefined environment variables, there are a few environment variables that Clubmask uses to inform your job script.
0 0 5 5 2 2 9 9
In a script job, there is always some output that is written to either standard out, or standard error. In Clubmask, the two output streams are combined into one stream, and redirected to a file. This file is named $JOBID.stdout. If you submitted a job, and were assigned a jobid of henken.1, the output would be redirected to a file named henken.1.stdout. This file is usually placed in the home directory of the user, but the placement can be changed with job submission arguments.
Clubmask is based on the Single System Image (SSI) paradigm of Beowulf clustering. When using the SSI method of executing processes on the nodes, the cluster utilizes the head node as a starting point for all of the processes. Yes that is right - the head node is used to start any process you wish to run on the nodes.
Clubmask uses a distributed process control layer called bproc
Bproc allows you to see all of the processes running on remote nodes as if they were running on the main login server. To start programs on nodes using Bproc, you will need to use the command 'bpsh'. In Clubmask, the nodes are numbered starting at 0 and increase as per your setup. Bpsh uses that node number when executing a process.
node2
When executing your processes on remote nodes, Bproc places a ghost process on the front-end machine that are used to communicate with the remote process. This ghost process is basically a place holder for the real remote process. For more inforamation, refer to the Bproc website.
This is easier to see in the following example, as you can see the process in the output on the front-end, but the output indicates the process is actually running on the remote node.
PID TTY STAT TIME COMMAND
5289 pts/9 S 0:00 bpsh 1 sleep 5
5291 pts/9 SW 0:00 \_ [sleep]
You may be used to using the shell input and output redirection characters, < or >, to grab input from a file, or to redirect output to a file. When using bpsh, it is recommended to NOT do that, rather you should use the -I ,-O, or -E flags to bpsh to redirect. For more information see man bpsh.
Depending on your cluster, the following information may or may not be applicable. Consult your cluster administrator to determine if ssh is installed.
Ssh is the standard secure shell for linux. As stated above, the nodes are numbered starting at 0, but for ssh to correctly resolve the nodenumber to a host, you must prefix the word 'node', or the appropriate node name, to the beginning of the nodenumber.
node2
There are several tradeoffs between bpsh and ssh. First and foremost is the difference in the time in which each takes to start executing the process on the node. Bpsh is several times faster than ssh, and it supports complex node list arguments that allow the processes to be started in near parallel fashion.
node1.internal
node2.internal
node3.internal
node4.internal
node5.internal
node8.internal
node19.internal
node20.internal
node21.internal
node22.internal
node23.internal
node24.internal
node25.internal
node26.internal
node27.internal
node28.internal
node29.internal
node30.internal
node31.internal
real 0m0.082s
user 0m0.040s
sys 0m0.170s
#!/bin/bash
for i in $(seq 100 104) $(seq 107 119); do (ssh node$i hostname) & done
wait
[henken@testcluster henken]$ time bash test.sh
node100.genomics.cluster
node103.genomics.cluster
node102.genomics.cluster
node107.genomics.cluster
node101.genomics.cluster
node104.genomics.cluster
node108.genomics.cluster
node109.genomics.cluster
node111.genomics.cluster
node113.genomics.cluster
node110.genomics.cluster
node112.genomics.cluster
node114.genomics.cluster
node115.genomics.cluster
node116.genomics.cluster
node117.genomics.cluster
node119.genomics.cluster
node118.genomics.cluster
real 0m1.496s
user 0m1.820s
sys 0m0.150s
There are a few applications that do not behave properly when run under bpsh. Below are some sections for each, with solutions when possible.
The only problem with Matlab, is that it uses the absolute path to the script to find files. As bproc changes this path for running scripts on the remote node, it fails. To work around this, you will need to explicitly use bash to run the script.
Java and Bproc just do not play nicely together. It is due to the glibc mess associated with pthreads and the clone system call. This results in the thread not getting the correct pid, and much badness ensues. You will either need to use ssh or just not use Java.
Here we will walk you through the necessary steps to create, submit, and run jobs. We will cover both Batch and Interactive jobs, each with their own unique set of steps. We are assuming that you have read and understand the two primer chapers on basic job information (see 2.2) and job environements (see 3).
The most basic Batch job will take a given set of nodes and run a command on each node. The following Batch job will use bpsh to run 'hostname' on each node.
A Batch job script needs to parse the environemnt variables given by the scheduler, and start the processes.
# Echo variables from scheduler
/bin/echo "Starting job:$JOBID on $NUMNODES nodes:$NODES"
# now run bpsh on each node
for node in $NODES;
do
(bpsh $node hostname)&
done
Now that we have a script for the scheduler to run, we need to submit it to be run. There are 2 important arguments that are needed when submitting a script. These are the number of nodes on which to run, and the wall clock duration of the job. To run a job on 4 processors for 10 minutes:
_______________________________________________________________
Welcome to Clubmask!!!
_______________________________________________________________
==================================================
Job Summary
==================================================
Job Identification Number: 0212141641.546417
Username: henken
Group: 2940
Queue: Batch
Number of Processors: 4
Maximum Wall-clock Run-Time (minutes): 10
Program: /mnt/io1/genomics/home/henken/jobs/basic.sh
Program Arguments:
Initial Working Directory: /mnt/io1/genomics/home/henken/jobs
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed...
showq
ACTIVE JOBS----------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
0212141641.546417 henken Running 4 0:10:00 Wed Feb 12 14:17:31
1 Active Job 4 of 64 Processors Active (6.25%)
2 of 32 Nodes Active (6.25%)
IDLE JOBS-----------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS--------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0
Now that the job is running, you may wish to watch the progress of the job. This can be done by viewing the stdout file for the job.
Starting job:0212141641.546417 with command:/usr/local/clubmask/packages/maui/run/0212141641.546417.program and args: to run on
nodes:1 1 21 21
Starting job:0212141641.546417 on 4 nodes:1 1 21 21
node1.internal
node1.internal
node21.internal
node21.internal
The stdout file above will contain all of the output for your job. It is there you should look for any errors.
The baic idea behind an Interactive job is asking the scheduler to reserve some set of nodes so that commands may be issued to the nodes from the command line. The first step is the request the nodes from the scheduler, as the following does for 4 processors for 30 minutes.
_______________________________________________________________
Welcome to Clubmask!!!
_______________________________________________________________
==================================================
Job Summary
==================================================
Job Identification Number: 0212155550.389526
Username: henken
Group: 2940
Job Classification: Interactive
Number of Processors: 4
Maximum Wall-clock Run-Time (minutes): 30
Program:
Program Arguments:
Initial Working Directory: /home/henken
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed..
Now that the job has reserved some nodes for us, we need to find what nodes we own. The function takes a JOBID as an arguments, and returns the nodes for that job
11 11 16 16
node11.internal
node16.internal
job '0212160055.851849' cancelled
The standard batch job script works just fine for a simple command, but when you really start to run jobs, there are a few types of batch processing that may help you get your work done.
The MPI environment in Clubmask is lam-mpi In order to write a script for a MPI job, the basics described in the Batch job section will be used. MPI stands for the Message Passing Interface, and it is one of the most popular standard for writing programs that will need inter-node communication.
The difference between a basic Batch job and an MPI job lie in the script that is submitted to the scheduler. In order to use lma-mpi, there are a few specific setup steps, as well as a different method of running the program. Below is a sample MPI script that will run the classic MPI sample program cpi. This program can be found in the section. Please pay carefull attention to the comments in the script for a description of what each step is doing.
# prog name
PROG="/home/henken/cpi"
# Your jobid from the scheduler
/bin/echo "JOBID:$JOBID"
# The nodes the scheduler has assigned your job
/bin/echo "NODES:$NODES"
# create lamhosts file
# This tells the lam environment what nodes you can use.
# The '-1' is neccessary to tell lam that we are going to start the mpi
# processes from the master node.
echo "-1" >> /tmp/lamnodes.$JOBID
for i in $NODES; do
echo "$i">> /tmp/lamnodes.$JOBID
done
# boot up lam - absoluteley necessary
# This will start a lamd on each node.
lamboot -v /tmp/lamnodes.$JOBID
# run the mpi program with some args
# see mpirun -help for more details
# The recommended way of running MPI jobs utilizes one of the following two
# mpirun commands. It is usually best to use the C and N notation, as the will
# allow you to run your program on all of your nodes/cpus, but will not run a
# process on the head node. If you use an explicit c0 or n0, it will use the
# head node for processing ( which is a BadThing(tm) )
# run one copy per cpu
mpirun C $PROG
# run one copy per node
mpirun N $PROG
# clean up - not absoluteley necessary, but a darn good idea
lamclean -v
#end lam session - absoluteley necessary
# if not done, running another mpi job may fail
lamhalt -v
Starting job:0212153059.262914 with command:/usr/local/clubmask/packages/maui/run/0212153059.262914.program and args:./cpi to run on
nodes:11 11 16 16
JOBID:0212153059.262914
NODES:11 11 16 16
NUMNODES: 4
LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University
Executing hboot on n0 (-1 - 1 CPU)...
Executing hboot on n1 (11 - 2 CPUs)...
Executing hboot on n2 (16 - 2 CPUs)...
topology done
Process 2 on 16
Process 0 on 11
Process 1 on 11
Process 3 on 16
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 2.386186
killing processes, done
closing files, done
sweeping traces, done
cleaning up registered objects, done
sweeping messages, done
LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University
Shutting down LAM
LAM halted
The scheduler is setup by your system administrator with a set of policies that will determine where and when your job can run.
For a complete listing of the options and arguments to cmsubmit, there is a cmsubmit manpage (see 9).
Your job may take up to one minute to appear in the queue after submission.
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
1021213635.183226 feisha Running 2 1:51:50 Mon Oct 21 21:36:46
1021192537.570371 libin Running 2 2:07:00:54 Mon Oct 21 19:25:50
2 Active Jobs 4 of 46 Processors Active (8.70%)
2 of 23 Nodes Active (8.70%)
IDLE JOBS-----------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS--------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 2 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 0
job '0212155550.389526' cancelled
There are a few more scheduler commands that you will find useful. The links for each command are to the official Maui scheduler webpages.
Listed here are the scripts and programs that are used in this user guide.
This chapter contains a manpage formatted document for each of the Clubmask commands.
cmsubmit
cmsubmit [options] -p <processors> -t <timelimit> <script [args>]
cmsubmit [options] -p <processors> -t <timelimit> -I
cmsubmit [options] -U <exiting_jobid>
When run, cmsubmit will add a job to the Clubmask database, or update the configuration of an existing job. To submit a new job, one must specify the number of processors to use, as well as the wallclock timelimit for the job.
Most jobs submitted to Clubmask are of the batch style, which is to say that the job is a script that will initialize the environment, run the desired computation, and cleanup. The other type of job is an interactive session on the nodes. In this case, the nodes are reserved for a certain amount of time, during which the user may invoke any command to those nodes.
It is preferrable to use batch style jobs whenever possible, as the use of interactive sessions greatly reduces the utilization and efficiency of the cluster.
When submitting a batch job to the cluster, supply the path to the script, as well as any command line arguments to the script that you may need. If you desire an interactive session, replace the script and arguments with a -I.
cmsubmit also allows you to modify the configuration of an already submitted job, as long as the job as not yet started running. To change an existing job, use -U <jobid>.
All of the previous invocations of cmsubmit can be modified with the following optional arguments:
cmsubmit will search your script for lines starting with the text marker # CM, and use the arguments in those lines just as if you would have typed them in on the command line. This allows you to just call cmsubmit on a job script without remembering or typing the arguments. Also, when running lots of different jobs, only one script for each job needs to be maintained, not a script to submit the job, as well as a script that is the job.
# CM -y
# CM -t 123
A regular batch job running thing.sh with the args hello world.
cmsumit -p 2 -t 120 thing.sh hello world
As of version 0.6, Clubmask uses the amount of virtual memory swap space as the disk space resource. This means that requesting a certain amount of scratch space will just request a certain amount of swap space. This is hoped to go away in 0.6.1 or 0.7, whenever we add monhole support for Supermon.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -no_subdir -show_section_numbers -toc_stars user_guide.tex
The translation was initiated by Nicholas Henke on 2003-12-15