Velocity Scheduler:
How to Run a Parallel Batch Job

This document provides step-by-step instructions to run a parallel batch job on a CTC compute node. The instructions are followed by a sample session demonstrating the steps.

Instructions
    This document assumes you have successfully run a serial batch job, that you used MPI to create a parallel program, and that you have compiled the code linking in the proper libraries.

    All instructions in this document should be issued from a command prompt window on a login node unless otherwise specified.

  1. Before you can submit any batch jobs, you must register your password with the scheduler. Do this before you use the scheduler for the first time, and again after every time you change your password:

    H:\Users\yourID> vsched -passwd

    Before you can run an MPI job, you must also register your password with MPI. This command will put the encrypted file .mpipass in your home directory.  You must re-register you password with MPI whenever you change your password.

    H:\Users\yourID> mpipasswd

  2. Prepare a job.xml file in the format shown here. All of the xml tags shown in the example are required. This file specifies number of minutes, number of nodes, etc. The main difference for serial versus parallel jobs is the number of nodes specified.
    MyJob.xml
    <?xml version="1.0" ?>
    <!-- Sample XML Job File -->
    <job>
    <nodes>4</nodes>
    <minutes>60</minutes>
    <type>batch</type>
    <affiliation>vplustest</affiliation>
    <run>\\tc.cornell.edu\tc\users\your_userid\your.bat</run>
    </job>

  3. Within the <run>...</run> tags you can specify any script or executable. The main differences for serial versus parallel jobs are that the script for a parallel job will

    • run the command vsched -m to create a machines file for mpirun to use
    • run a script to set up all the compute nodes in the machines file, e.g. copy files
    • start the executable using mpirun
    • call a script to clean up all the compute nodes in the machines file, e.g. delete files

    Here is a sample with comments:

    wave.bat
    
    
    REM Move to the T drive
    cd /D T:\
    
    REM Create a file called "machines" on the master node
    vsched -m
                   
    REM Use mpirun to run the setup script on all nodes in this job
    mpirun -np 4 \\tc.cornell.edu\tc\users\%USERNAME%\batch\setup.bat
    
    REM Copy the input file to the master node
    copy \\tc.cornell.edu\tc\users\%USERNAME%\batch\wave.in      T:\%USERNAME%
                    
    REM - - - - At this point, all of the nodes in the job have
    REM - - - - the necessary files.
            
    REM Move the machines file from T: to T:\%USERNAME%
    cd T:\%USERNAME%
    move T:\machines T:\%USERNAME%
            
    REM Run the MPI program with mpirun.  
    REM Set -np to the number of tasks.
    REM -wd is the working directory used by mpirun.
    mpirun  -wd T:\%USERNAME% -np 8 wavesend.exe 1>waveOutput.txt 2>waveError.txt
            
    REM Copy any output unique to the master task back to the H drive.
    copy /y T:\%USERNAME%\wave*.* \\tc.cornell.edu\tc\users\%USERNAME%\batch
            
    REM Use mpirun to run  the cleanup script on all nodes in this job
    mpirun -np 4 \\tc.cornell.edu\tc\users\%USERNAME%\batch\cleanup.bat
    
    REM Release the nodes
    vsched -cancel
    
    Notice this script calls setup.bat, which copies files to the job's compute nodes:
    setup.bat
    
    REM setup.bat
            
    REM Create a clean local temp folder, T:\myuserid, on each node
    call TDirCreate.bat 	 
                 
    REM Copy the executable (and data files, if necessary) to each node
    copy \\tc.cornell.edu\tc\users\%USERNAME%\batch\wavesend.exe T:\%USERNAME%
    
    The main batch script also calls cleanup.bat, which removes files from all of the job's compute nodes:
    cleanup.bat
    
    REM cleanup.bat
    
    REM Copy the output files to the H drive.
    REM If data files are created on all nodes, be careful
    REM to use unique files names, e.g. by naming them from
    REM within the program making use of the task id.
    REM Note: in this sample code, only the master node has output files.
    
    copy /Y T:\%USERNAME%\wave*.* \\tc.cornell.edu\tc\users\%USERNAME%\batch
    
    REM Delete the local temp folder and everything in it 
    call TDirDelete.bat 
    

  4. Submit the xml file from the command prompt:

    H:\Users\yourID> vsched -submit job_name.xml

  5. Your job should now either be running or be in the queue waiting to start. At this point you can simply wait for it to finish, or you can view the queue

    H:\Users\yourID> vsched -q

    or cancel your job

    H:\Users\yourID> vsched -c <JobID>

    or restart your job

    H:\Users\yourID> vsched -r <JobID>

    or use Remote Desktop Connect to log into the node where your job is running to either see that the job is running properly, or to issue commands.
Example
This sample session begins after you have logged into a CTC login node and have opened a command prompt window. If you have a CTC computing account, you can use the files found in
\\tc.cornell.edu\tc\VWLabs\vsched\parallel\
to run this example. Be sure to copy the files to your home folder and modify the paths in the .xml and .bat files.

H:\Users\yourID>set PATH=%PATH%;c:\program files\velocity

H:\Users\yourID>vsched -passwd
Please enter your password : ***********
Please confirm your password : ***********
CTC_ITH\yourID data updated

H:\Users\yourID>mpipasswd

Registering password for: CTC_ITH\yourID
Login password:
Confirm password:

Encrypted password saved to H:\users\yourID\.mpipass
Press ENTER to continue...


H:\Users\yourID>dir wave*.*
 Volume in drive H has no label.
 Volume Serial Number is A417-76E2

 Directory of H:\Users\yourID

04/27/2006  11:19 AM             1,241 wave.bat
04/28/2000  10:25 AM                11 wave.in
04/26/2006  01:36 PM               232 wave.xml
04/27/2006  08:50 AM            69,632 wavesend.exe
               6 File(s)         71,492 bytes
               0 Dir(s)  263,468,470,272 bytes free


H:\Users\yourID>vsched -s wave.xml
1737

H:\Users\yourID>vsched -q
JobId   User     Nodes  Time  Type Stat  End Time    Master       Affiliation
-----------------------------------------------------------------------------
1737    susan        2  0:15    B   C    13:52 04/27 ctc065       vplustest

H:\Users\yourID>dir wave*.*
 Volume in drive H is fsrv10_J
 Volume Serial Number is F428-FA2F

 Directory of H:\Users\yourID

04/27/2006  11:19 AM             1,241 wave.bat
04/28/2000  10:25 AM                11 wave.in
04/26/2006  01:36 PM               232 wave.xml
04/27/2006  01:36 PM                 0 waveError.txt
04/27/2006  01:36 PM               376 waveOutput.txt
04/27/2006  08:50 AM            69,632 wavesend.exe
               6 File(s)         71,492 bytes
               0 Dir(s)  263,468,470,272 bytes free

H:\Users\yourID>type waveOutput.txt
1: 1: Wave Program running
2: 2: Wave Program running
0: 0: Wave Program running
3: 3: Wave Program running
2: 2: first = 51, npoints = 25
1: 1: first = 26, npoints = 25
0: 0: points = 100, steps = 1000
3: 3: first = 76, npoints = 25
0: 0: first = 1, npoints = 25
0: first 10 points (for validation):
0: 0.00  0.06  0.12  0.19  0.25  0.31  0.36  0.42  0.48  0.53

H:\Users\yourID>