next_inactive up previous


Clubmask User Guide

Nicholas Henke
Liniac Project, University of Pensylvania
http://www.liniac.upenn.edu


Date: 12/15/2003, HEAD


Contents

1 Introduction

This guide will help you to get familiar with the user environment, scheduling commands, and job parameters that are used by CLUBMASK. If you have any ideas or questions about this document, please feel free to contact us.

1.1 Contact Information

There are various ways to get in contact with the Clubmask developers.

Clubmask Home Page

Clubmask's Sourceforge.net Project Page

Clubmask Mailing Lists

Clubmask Bugs and Feature Requests

#clubmask channel on IRC server, irc.freenode.net

For normal problems and requests, it is most effective to subscribe first read the archives for the clubmask-users mailing list, and if the issue has not been addressed, to email the list. For help in debugging strange or complex issues, you may first check to see if we are on IRC, as we are most days during business hours ( Eastern Standard Time).

1.2 Style

This is just a quick note on the style of this guide. There should be a lot more content here, but for now just the important stuff. In this guide, there are examples, and you may notice lines ending in '\'. This is not a literal character to quote, rather an indication of a line that is too long being continued on the next line. In the following example, the command is really one line, but appears on 2 lines.

henken@roughneck docs $ ls -la ~henken/src/clubmask/clubmask/lib/\  
python/database/ClubmaskClass.py 

2 Basic Cluster Information

There are a few basic conceps that need to be understood before using a cluster.

2.1 Nodes

A node is a machine. The machine may have multiple processors, different network adapters, and varying amounts of ram and diskspace, but all in all, it is just a machine. Simple eh?


2.2 Jobs

A job is a basic 'unit of work'. Jobs typically consist of common computations grouped together. There are two main types of jobs that can be run on a cluster.

2.2.1 Batch Job

A Batch job is a job that consists of computations that are run from a shell script which runs individual computations on nodes. Batch jobs are usually scripts that start the computations on the nodes, without any user interaction. A Batch job can also be a script to setup and run an MPI program, or someother type of paralell job.

2.2.2 Interactive Job

An Interactive job is a job where the nodes are reserved for the user to log into and issues commands by hand. Interactive jobs should only be used for testing batch scripts, as they will decrease the utilization of your cluster.

2.2.3 Choosing Batch vs Interactive Jobs

In Batch style jobs, there is a script that is written and submitted to the scheduler that does of the work that the job needs to do. In comparison, in an Interactive job, the scheduler merely reserves the nodes and changes the permissions on each so that you can run commands on the nodes. Interactive jobs are usually used so that a person can become familiar with the computing environment and test their code before attempting a long production run.

When a script is submitted to the scheduler, the scheduler will take the request, and process it through it's algorithms and determine on which nodes it will run, and at what time the job will start. The same is done for an Interactive job as well. must

A common misconception is that a Batch or MPI style job will just setup the nodes for a job, and then allow you to run commands from the command line. This is simply not true. If you are going to use Batch or MPI jobs, the script that is submitted to the scheduler do all of the work. The scheduler will automatically detect when the script has exited, and return the nodes to the scheduler for further use.

When submitting a script to the scheduler, you must remember that this script will be copied to the Clubmask spool directory. If you want to use files or programs relative to the script, you will need to code in the paths, or change the working directory via cmsubmit.


3 Job Environment

3.1 Environement variables

3.1.1 Submit time capture

When a job is submitted, cmsubmit will take a snapshot of your current environment. This snapshot will be used to setup the environement for the job just before executing your job script.

3.1.2 Clubmask injected variables.

In addition to your predefined environment variables, there are a few environment variables that Clubmask uses to inform your job script.

JOBID
A string indicating the name of the job. By default your username is used, but this can be changed via cmsubmit.
NODES
Space delimited string of node numbers that your job has been assigned. If your job has been assigned a dual cpu machine, the node number for it will appear twice, similarily for nodes with quad cpus, it will appear 4 times. There is no guarantee that NODES will contain contiguous nodes, as the nodes may be scattered across the cluster. Example:
echo $NODES

0 0 5 5 2 2 9 9 

3.2 Job Standard Output and Error (stdout/stderr)

In a script job, there is always some output that is written to either standard out, or standard error. In Clubmask, the two output streams are combined into one stream, and redirected to a file. This file is named $JOBID.stdout. If you submitted a job, and were assigned a jobid of henken.1, the output would be redirected to a file named henken.1.stdout. This file is usually placed in the home directory of the user, but the placement can be changed with job submission arguments.

4 Bproc Process Environment

Clubmask is based on the Single System Image (SSI) paradigm of Beowulf clustering. When using the SSI method of executing processes on the nodes, the cluster utilizes the head node as a starting point for all of the processes. Yes that is right - the head node is used to start any process you wish to run on the nodes.

4.1 bpsh

Clubmask uses a distributed process control layer called bproc

Bproc allows you to see all of the processes running on remote nodes as if they were running on the main login server. To start programs on nodes using Bproc, you will need to use the command 'bpsh'. In Clubmask, the nodes are numbered starting at 0 and increase as per your setup. Bpsh uses that node number when executing a process.

[henken@alpha henken]$ bpsh 2 hostname

node2

        

4.1.1 Ghost Processes

When executing your processes on remote nodes, Bproc places a ghost process on the front-end machine that are used to communicate with the remote process. This ghost process is basically a place holder for the real remote process. For more inforamation, refer to the Bproc website.

This is easier to see in the following example, as you can see the process in the output on the front-end, but the output indicates the process is actually running on the remote node.

[root@struggles clubmask]# bpsh 1 sleep 5

            

Here is the output from ps:

[root@struggles autoconf-2.57]# ps -axf

  PID TTY      STAT   TIME COMMAND

  5289 pts/9    S      0:00  bpsh 1 sleep 5

  5291 pts/9    SW     0:00  \_ [sleep]

            

As you can see, the sleep process is running as if it was a kernel thread ( the square brackets ). This sleep process is actually running on node 1.

4.1.2 Standard IO Redirection

You may be used to using the shell input and output redirection characters, < or >, to grab input from a file, or to redirect output to a file. When using bpsh, it is recommended to NOT do that, rather you should use the -I ,-O, or -E flags to bpsh to redirect. For more information see man bpsh.

4.2 ssh

Depending on your cluster, the following information may or may not be applicable. Consult your cluster administrator to determine if ssh is installed.

Ssh is the standard secure shell for linux. As stated above, the nodes are numbered starting at 0, but for ssh to correctly resolve the nodenumber to a host, you must prefix the word 'node', or the appropriate node name, to the beginning of the nodenumber.

[henken@alpha henken]$ ssh node2 hostname

node2

4.3 bpsh vs ssh

There are several tradeoffs between bpsh and ssh. First and foremost is the difference in the time in which each takes to start executing the process on the node. Bpsh is several times faster than ssh, and it supports complex node list arguments that allow the processes to be started in near parallel fashion.

[henken@testcluster henken]$ time bpsh 1-5,8,19-31 hostname

node1.internal

node2.internal

node3.internal

node4.internal

node5.internal

node8.internal

node19.internal

node20.internal

node21.internal

node22.internal

node23.internal

node24.internal

node25.internal

node26.internal

node27.internal

node28.internal

node29.internal

node30.internal

node31.internal

real    0m0.082s

user    0m0.040s

sys     0m0.170s

        

To achieve the same thing via ssh you would need to invoke a separate ssh for each node.

[henken@testcluster henken]$ cat test.sh 

#!/bin/bash

for i in $(seq 100 104) $(seq 107 119); do (ssh node$i hostname) & done

wait

[henken@testcluster henken]$ time bash test.sh 

node100.genomics.cluster

node103.genomics.cluster

node102.genomics.cluster

node107.genomics.cluster

node101.genomics.cluster

node104.genomics.cluster

node108.genomics.cluster

node109.genomics.cluster

node111.genomics.cluster

node113.genomics.cluster

node110.genomics.cluster

node112.genomics.cluster

node114.genomics.cluster

node115.genomics.cluster

node116.genomics.cluster

node117.genomics.cluster

node119.genomics.cluster

node118.genomics.cluster

real    0m1.496s

user    0m1.820s

sys     0m0.150s

        

As you can see the ssh version took around 1.5 seconds, where the bpsh invocation took just under .1 seconds.

4.4 Incompatibilities

There are a few applications that do not behave properly when run under bpsh. Below are some sections for each, with solutions when possible.

4.4.1 Matlab matlab.sh

The only problem with Matlab, is that it uses the absolute path to the script to find files. As bproc changes this path for running scripts on the remote node, it fails. To work around this, you will need to explicitly use bash to run the script.

[henken@testcluster henken]$ bpsh 2 bash /path/to/matlab.sh

            

4.4.2 Java

Java and Bproc just do not play nicely together. It is due to the glibc mess associated with pthreads and the clone system call. This results in the thread not getting the correct pid, and much badness ensues. You will either need to use ssh or just not use Java.

5 Running Jobs

Here we will walk you through the necessary steps to create, submit, and run jobs. We will cover both Batch and Interactive jobs, each with their own unique set of steps. We are assuming that you have read and understand the two primer chapers on basic job information (see 2.2) and job environements (see 3).

5.1 Batch Job

The most basic Batch job will take a given set of nodes and run a command on each node. The following Batch job will use bpsh to run 'hostname' on each node.

5.1.1 Batch Job Script

A Batch job script needs to parse the environemnt variables given by the scheduler, and start the processes.

#!/bin/bash

# Echo variables from scheduler

/bin/echo "Starting job:$JOBID on $NUMNODES nodes:$NODES"

# now run bpsh on each node

for node in $NODES;

do

    (bpsh $node hostname)&

done

The (...)& around the bpsh $node hostname executes the bpsh in a background subshell. This allows all of the commands to run in parallel.

5.1.2 Submitting the Batch Job

Now that we have a script for the scheduler to run, we need to submit it to be run. There are 2 important arguments that are needed when submitting a script. These are the number of nodes on which to run, and the wall clock duration of the job. To run a job on 4 processors for 10 minutes:

[henken@testcluster jobs]$ cmsubmit -p 4 -t 10 basic.sh 

_______________________________________________________________

Welcome to Clubmask!!!

_______________________________________________________________

==================================================

Job Summary

==================================================

Job Identification Number:  0212141641.546417

Username:  henken

Group:  2940

Queue:  Batch

Number of Processors:  4

Maximum Wall-clock Run-Time (minutes):  10

Program:  /mnt/io1/genomics/home/henken/jobs/basic.sh

Program Arguments:  

Initial Working Directory: /mnt/io1/genomics/home/henken/jobs

Job Output Directory: /home/henken

==================================================

Commit or Abort: ('c', 'a') ? c

please wait while database is updated...

please wait while database connection is closed...

         showq

Now that the job is submitted, we want to know if the job is executing. The use of the command will show the state of the scheduler.

[henken@testcluster jobs]$ showq

ACTIVE JOBS----------

JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

0212141641.546417    henken    Running     4     0:10:00  Wed Feb 12 14:17:31

     1 Active Job        4 of   64 Processors Active (6.25%)

                         2 of   32 Nodes Active      (6.25%)

IDLE JOBS-----------

JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

0 Idle Jobs

BLOCKED JOBS--------

JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

        

You can see that the job is executing.

5.1.3 Batch Job Output

Now that the job is running, you may wish to watch the progress of the job. This can be done by viewing the stdout file for the job.

[henken@testcluster jobs]$ tail -f ~/0212141641.546417.stdout 

Starting job:0212141641.546417 with command:/usr/local/clubmask/packages/maui/run/0212141641.546417.program and args: to run on

nodes:1 1 21 21

Starting job:0212141641.546417 on 4 nodes:1 1 21 21

node1.internal

node1.internal

node21.internal

node21.internal

        

You can now watch the your job produce output as it executes. The job will run to completion, and when the scheduler detects that all of the proceeses are finished, your job will be marked as done, with the nodes that were assigned placed back in the queue for further use.

The stdout file above will contain all of the output for your job. It is there you should look for any errors.

5.2 Interactive job

The baic idea behind an Interactive job is asking the scheduler to reserve some set of nodes so that commands may be issued to the nodes from the command line. The first step is the request the nodes from the scheduler, as the following does for 4 processors for 30 minutes.

[henken@beta henken]$ cmsubmit -p 4 -I -t 30

_______________________________________________________________

Welcome to Clubmask!!!

_______________________________________________________________

==================================================

Job Summary

==================================================

Job Identification Number:  0212155550.389526

Username:  henken

Group:  2940

Job Classification:  Interactive

Number of Processors:  4

Maximum Wall-clock Run-Time (minutes):  30

Program:  

Program Arguments:  

Initial Working Directory: /home/henken

Job Output Directory: /home/henken

==================================================

Commit or Abort: ('c', 'a') ? c

please wait while database is updated...

please wait while database connection is closed..

    

Now that the nodes are requested, you must wait until the job is started. The use of the showq and checkjob commands will give you a good idea when it will start. cmgetnodes

Now that the job has reserved some nodes for us, we need to find what nodes we own. The function takes a JOBID as an arguments, and returns the nodes for that job

[henken@beta henken]$ cmgetnodes 0212160055.851849

11 11 16 16

    

Now that we 'own' nodes 11 and 16, it is time to do some work on them.

[henken@beta henken]$ bpsh 11,16 hostname

node11.internal

node16.internal

    

With that exhausting amount of computation done, we will release the nodes back into the queue.

[henken@beta henken]$ canceljob 0212160055.851849

job '0212160055.851849' cancelled

    

The use of canceljob is not entirely necessary here, as the job will be stopped at the end of the allotted time, but it is a courtesy to the other users to return them when done. If your cluster administrator notices abuse of this 'nicety', the scheduler may be instructed to limit the time and number of Interactive jobs to prevent users from tying up the entire system and preventing the execution of scripted jobs.

6 Special types of Batch jobs

The standard batch job script works just fine for a simple command, but when you really start to run jobs, there are a few types of batch processing that may help you get your work done.

6.1 MPI Job

The MPI environment in Clubmask is lam-mpi In order to write a script for a MPI job, the basics described in the Batch job section will be used. MPI stands for the Message Passing Interface, and it is one of the most popular standard for writing programs that will need inter-node communication.

The difference between a basic Batch job and an MPI job lie in the script that is submitted to the scheduler. In order to use lma-mpi, there are a few specific setup steps, as well as a different method of running the program. Below is a sample MPI script that will run the classic MPI sample program cpi. This program can be found in the section. Please pay carefull attention to the comments in the script for a description of what each step is doing.

#!/bin/bash

# prog name

PROG="/home/henken/cpi"

# Your jobid from the scheduler

/bin/echo "JOBID:$JOBID"

# The nodes the scheduler has assigned your job

/bin/echo "NODES:$NODES"

# create lamhosts file 

# This tells the lam environment what nodes you can use.

# The '-1' is neccessary to tell lam that we are going to start the mpi

# processes from the master node.

echo "-1" >> /tmp/lamnodes.$JOBID

for i in $NODES; do

    echo "$i">> /tmp/lamnodes.$JOBID

done

# boot up lam - absoluteley necessary

# This will start a lamd on each node.

lamboot -v /tmp/lamnodes.$JOBID

# run the mpi program with some args

# see mpirun -help for more details

# The recommended way of running MPI jobs utilizes one of the following two

# mpirun commands. It is usually best to use the C and N notation, as the will

# allow you to run your program on all of your nodes/cpus, but will not run a

# process on the head node. If you use an explicit c0 or n0, it will use the

# head node for processing ( which is a BadThing(tm) )

# run one copy per cpu

mpirun C $PROG

# run one copy per node

mpirun N $PROG

# clean up - not absoluteley necessary, but a darn good idea

lamclean -v

#end lam session  - absoluteley necessary

# if not done, running another mpi job may fail

lamhalt -v

        

And now for the ever interesting output.

[henken@beta alpha]$ more ~/0212153059.262914.stdout 

Starting job:0212153059.262914 with command:/usr/local/clubmask/packages/maui/run/0212153059.262914.program and args:./cpi to run on

nodes:11 11 16 16

JOBID:0212153059.262914

NODES:11 11 16 16

NUMNODES: 4

LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University

Executing hboot on n0 (-1 - 1 CPU)...

Executing hboot on n1 (11 - 2 CPUs)...

Executing hboot on n2 (16 - 2 CPUs)...

topology done      

Process 2 on 16

Process 0 on 11

Process 1 on 11

Process 3 on 16

pi is approximately 3.1416009869231249, Error is 0.0000083333333318

wall clock time = 2.386186

killing processes, done      

closing files, done      

sweeping traces, done      

cleaning up registered objects, done      

sweeping messages, done      

LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University

Shutting down LAM

LAM halted

        

7 Job Submission and Cancelation

The scheduler is setup by your system administrator with a set of policies that will determine where and when your job can run.

7.1 cmsubmit

For a complete listing of the options and arguments to cmsubmit, there is a cmsubmit manpage (see 9).

7.2 showq

Your job may take up to one minute to appear in the queue after submission.

ACTIVE JOBS----------

           JOBNAME USERNAME      STATE  PROC   REMAINING            STARTTIME

 1021213635.183226   feisha    Running     2     1:51:50  Mon Oct 21 21:36:46

 1021192537.570371    libin    Running     2  2:07:00:54  Mon Oct 21 19:25:50

     2 Active Jobs       4 of   46 Processors Active (8.70%)

                         2 of   23 Nodes Active      (8.70%)

IDLE JOBS-----------

           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

0 Idle Jobs

BLOCKED JOBS--------

           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

Total Jobs: 2   Active Jobs: 2   Idle Jobs: 0   Blocked Jobs: 0

        

Once you have submitted your job, the next command you will use is 'showq'. Showq will show you the running and idle queues in the scheduler. Here is an example output from showq, the ACTIVE jobs are jobs that are currently running, IDLE jobs are those waiting for free nodes on which to run, and BLOCKED jobs are those jobs failing to run. For more information on showq, see showq.

7.3 canceljob

[henken@beta henken]$ canceljob 0212155550.389526

job '0212155550.389526' cancelled

        

To cancel a job, you will use the 'canceljob' canceljob. Canceljob takes the jobid of a job as it's argument.

7.4 Additional Commands

There are a few more scheduler commands that you will find useful. The links for each command are to the official Maui scheduler webpages.

canceljob
Cancel a job.
checkjob
Provide detailed status report for specified job.
showbf
Show backfill window - show resources available for immediate use.
showq
Show queued jobs.
showres
Show existing reservations.
showstart
Show estimates of when job can/will start.

8 Samples

Listed here are the scripts and programs that are used in this user guide.

Simple Batch Job
basic.sh

MPI job script
cpi.sh

MPI test cpi program
cpi.c

9 Manpages

This chapter contains a manpage formatted document for each of the Clubmask commands.


cmsubmit

NAME

cmsubmit
Clubmask program for job submission and modification.

SYNOPSIS

cmsubmit

cmsubmit [options] -p <processors> -t <timelimit> <script [args>]

cmsubmit [options] -p <processors> -t <timelimit> -I

cmsubmit [options] -U <exiting_jobid>

DESCRIPTION

When run, cmsubmit will add a job to the Clubmask database, or update the configuration of an existing job. To submit a new job, one must specify the number of processors to use, as well as the wallclock timelimit for the job.

Most jobs submitted to Clubmask are of the batch style, which is to say that the job is a script that will initialize the environment, run the desired computation, and cleanup. The other type of job is an interactive session on the nodes. In this case, the nodes are reserved for a certain amount of time, during which the user may invoke any command to those nodes.

It is preferrable to use batch style jobs whenever possible, as the use of interactive sessions greatly reduces the utilization and efficiency of the cluster.

When submitting a batch job to the cluster, supply the path to the script, as well as any command line arguments to the script that you may need. If you desire an interactive session, replace the script and arguments with a -I.

cmsubmit also allows you to modify the configuration of an already submitted job, as long as the job as not yet started running. To change an existing job, use -U <jobid>.

All of the previous invocations of cmsubmit can be modified with the following optional arguments:

-c <queue>
Submit the job into a special queue in the Maui scheduler.
-d <directory>
Place the job standard output/error file in a directory. By default the user's home directory is used.
-E
Do not send email on job completion.
-n <name>
Use <name> as the jobid prefix. By default, the username of the user is used.
-P <processors_per_node>
Request that the job only be run on nodes with a certain number of processors.
-r <reservation>
Run the job in an existing advanced or standing reservation. These reservations are usually setup by your cluster administrator.
-q
do not print any summary information before submitting the job
-Q <qos_name>
Use a quality of service designation for Maui when running this job.
-w <directory_name>
Use this directory as the initial working directory when running your job. By default, the directory from where cmsumbit is called will be used.
-y
Do not ask for confirmation when submitting this job, just go ahead and submit it.
A job can also request certain resource requirements be available on the nodes. These resource requests take the form of:

-ARG '<comparator> value>'
 

-ARG
One of the following resource designators:

-m
RAM in the machine, in MB.
-s
Scratch space configured on disk.
-S
Virtual memory swap space.
<comparator>
One of < <= == >= >
value
An integer number

Placing arguments in the script

cmsubmit will search your script for lines starting with the text marker # CM, and use the arguments in those lines just as if you would have typed them in on the command line. This allows you to just call cmsubmit on a job script without remembering or typing the arguments. Also, when running lots of different jobs, only one script for each job needs to be maintained, not a script to submit the job, as well as a script that is the job.

# CM -p 32

# CM -y

# CM -t 123

You could now submit the above script by typing cmsubmit job.sh, with the args in the script taking place of the normal placement on the command line. If you specify arguements on the command line and in the script, the arguments from the command line will override those in the script.

EXAMPLES

A regular batch job running thing.sh with the args hello world.

cmsumit -p 2 -t 120 thing.sh hello world

BUGS

As of version 0.6, Clubmask uses the amount of virtual memory swap space as the disk space resource. This means that requesting a certain amount of scratch space will just request a certain amount of swap space. This is hoped to go away in 0.6.1 or 0.7, whenever we add monhole support for Supermon.

About this document ...

Clubmask User Guide

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -no_subdir -show_section_numbers -toc_stars user_guide.tex

The translation was initiated by Nicholas Henke on 2003-12-15


next_inactive up previous
Nicholas Henke 2003-12-15