|
Velocity Scheduler:
How to Run a Parallel Batch Job
This document provides step-by-step instructions to run a parallel batch job on
a CTC compute node. The instructions are followed by a sample session
demonstrating the steps.
|
Instructions |
This document assumes you have successfully run a serial batch
job, that you used
MPI to create a parallel program, and that you have
compiled the code linking in the proper libraries.
All instructions in this document should be issued from a command prompt window
on a login node unless otherwise specified.
-
Before you can submit any batch jobs, you must register your password with the
scheduler. Do this before you use the scheduler for the first time, and again
after every time you change your password:
H:\Users\yourID> vsched -passwd
Before you can run an MPI job, you must also register your password with MPI.
This command will put the encrypted file .mpipass
in your home directory. You must re-register you password with MPI
whenever you change your password.
H:\Users\yourID> mpipasswd
-
Prepare a job.xml file in the format shown here. All of the xml tags shown in
the example are required. This file specifies number of minutes, number of
nodes, etc. The main difference for serial versus parallel jobs is the number
of nodes specified.
| MyJob.xml |
<?xml version="1.0" ?>
<!-- Sample XML Job File -->
<job>
<nodes>4</nodes>
<minutes>60</minutes>
<type>batch</type>
<affiliation>vplustest</affiliation>
<run>\\tc.cornell.edu\tc\users\your_userid\your.bat</run>
</job>
|
-
Within the <run>...</run> tags you can specify any script or
executable. The main differences for serial versus parallel jobs are that the
script for a parallel job will
-
run the command vsched -m
to create a machines file for mpirun to use
-
run a script to set up all the compute nodes in the machines file, e.g. copy
files
-
start the executable using mpirun
-
call a script to clean up all the compute nodes in the machines file, e.g.
delete files
Here is a sample with comments:
| wave.bat |
REM Move to the T drive
cd /D T:\
REM Create a file called "machines" on the master node
vsched -m
REM Use mpirun to run the setup script on all nodes in this job
mpirun -np 4 \\tc.cornell.edu\tc\users\%USERNAME%\batch\setup.bat
REM Copy the input file to the master node
copy \\tc.cornell.edu\tc\users\%USERNAME%\batch\wave.in T:\%USERNAME%
REM - - - - At this point, all of the nodes in the job have
REM - - - - the necessary files.
REM Move the machines file from T: to T:\%USERNAME%
cd T:\%USERNAME%
move T:\machines T:\%USERNAME%
REM Run the MPI program with mpirun.
REM Set -np to the number of tasks.
REM -wd is the working directory used by mpirun.
mpirun -wd T:\%USERNAME% -np 8 wavesend.exe 1>waveOutput.txt 2>waveError.txt
REM Copy any output unique to the master task back to the H drive.
copy /y T:\%USERNAME%\wave*.* \\tc.cornell.edu\tc\users\%USERNAME%\batch
REM Use mpirun to run the cleanup script on all nodes in this job
mpirun -np 4 \\tc.cornell.edu\tc\users\%USERNAME%\batch\cleanup.bat
REM Release the nodes
vsched -cancel
|
Notice this script calls setup.bat, which copies files to the job's compute
nodes:
| setup.bat |
REM setup.bat
REM Create a clean local temp folder, T:\myuserid, on each node
call TDirCreate.bat
REM Copy the executable (and data files, if necessary) to each node
copy \\tc.cornell.edu\tc\users\%USERNAME%\batch\wavesend.exe T:\%USERNAME%
|
The main batch script also calls cleanup.bat, which removes files from all of
the job's compute nodes:
| cleanup.bat |
REM cleanup.bat
REM Copy the output files to the H drive.
REM If data files are created on all nodes, be careful
REM to use unique files names, e.g. by naming them from
REM within the program making use of the task id.
REM Note: in this sample code, only the master node has output files.
copy /Y T:\%USERNAME%\wave*.* \\tc.cornell.edu\tc\users\%USERNAME%\batch
REM Delete the local temp folder and everything in it
call TDirDelete.bat
|
-
Submit the xml file from the command prompt:
H:\Users\yourID> vsched -submit job_name.xml
-
Your job should now either be running or be in the queue waiting to start. At
this point you can simply wait for it to finish, or you can view the queue
H:\Users\yourID> vsched -q
or cancel your job
H:\Users\yourID> vsched -c <JobID>
or restart your job
H:\Users\yourID> vsched -r <JobID>
or use Remote Desktop Connect to log into the node where your job is running to
either see that the job is running properly, or to issue commands.
|
|
Example |
This sample session begins after you have logged into a CTC login node and have
opened a command prompt window. If you have a CTC computing account, you can
use the files found in \\tc.cornell.edu\tc\VWLabs\vsched\parallel\
to run this example. Be sure to copy the files to your home folder and modify
the paths in the .xml and .bat files.
H:\Users\yourID>set PATH=%PATH%;c:\program files\velocity
H:\Users\yourID>vsched -passwd
Please enter your password : ***********
Please confirm your password : ***********
CTC_ITH\yourID data updated
H:\Users\yourID>mpipasswd
Registering password for: CTC_ITH\yourID
Login password:
Confirm password:
Encrypted password saved to H:\users\yourID\.mpipass
Press ENTER to continue...
H:\Users\yourID>dir wave*.*
Volume in drive H has no label.
Volume Serial Number is A417-76E2
Directory of H:\Users\yourID
04/27/2006 11:19 AM 1,241 wave.bat
04/28/2000 10:25 AM 11 wave.in
04/26/2006 01:36 PM 232 wave.xml
04/27/2006 08:50 AM 69,632 wavesend.exe
6 File(s) 71,492 bytes
0 Dir(s) 263,468,470,272 bytes free
H:\Users\yourID>vsched -s wave.xml
1737
H:\Users\yourID>vsched -q
JobId User Nodes Time Type Stat End Time Master Affiliation
-----------------------------------------------------------------------------
1737 susan 2 0:15 B C 13:52 04/27 ctc065 vplustest
H:\Users\yourID>dir wave*.*
Volume in drive H is fsrv10_J
Volume Serial Number is F428-FA2F
Directory of H:\Users\yourID
04/27/2006 11:19 AM 1,241 wave.bat
04/28/2000 10:25 AM 11 wave.in
04/26/2006 01:36 PM 232 wave.xml
04/27/2006 01:36 PM 0 waveError.txt
04/27/2006 01:36 PM 376 waveOutput.txt
04/27/2006 08:50 AM 69,632 wavesend.exe
6 File(s) 71,492 bytes
0 Dir(s) 263,468,470,272 bytes free
H:\Users\yourID>type waveOutput.txt
1: 1: Wave Program running
2: 2: Wave Program running
0: 0: Wave Program running
3: 3: Wave Program running
2: 2: first = 51, npoints = 25
1: 1: first = 26, npoints = 25
0: 0: points = 100, steps = 1000
3: 3: first = 76, npoints = 25
0: 0: first = 1, npoints = 25
0: first 10 points (for validation):
0: 0.00 0.06 0.12 0.19 0.25 0.31 0.36 0.42 0.48 0.53
H:\Users\yourID>
|
|