| |
Faster Cleanup of Batch Jobs (Detailed MPI Example)
Issue: When multi-node MPI jobs do not end cleanly, they take longer to clear. The clearing procedure is such that it must complete before other jobs can start.
Solution: Each person running MPI jobs can improve the situation by adding some cleanup steps to their batch scripts.
The referenced set of files provides a scheme for the clean shutdown of an MPI job so that it clears faster from the queue. Just for completeness, the list also contains a script for periodically copying files from T: on the master node to H:.
Typical MPI job: three mpirun commands
- setup to create T:\%USERNAME% on each node and copy files from H: to T:
- job execution, all files written to T:
- cleanup to copy files from T: to H: after execution completes
Additional Elements
- Limiting duration of a job so that it is not killed by the system
Suppose that your job times out and is killed by the system. The scheme described here is an easy way to ensure that your jobs will take the minimum amount of time clearing.
By running a background script that sleeps for most of the job, wakes up, and then kills the mpirun associated with the executable, you have an opportunity for a clean shutdown of the job. After the mpirun is killed on the master node, the next mpirun in the .bat file kills the executables on each node. You need to know the names of the executables so that they can be put in a file. Then the job will progress to the cleanup, etc.
- Copying standard output and standard error from T: to H: during the job
If you want to see standard output and standard error as the job progresses, you can run a background script to periodically copy these files from T: to H:. By directing the files initially to T:, the job will not be impeded if there is a brief problem writing to H:. Keeping a file open to H: for a long time is a liability. If there is a problem connecting to H: and you have a file open to H:, the job may fail.
Files (Note: If you save these files to your home machine, be sure to use the .bat file extension)
|
|
The .xml file is that one that you submit to batch. It runs the .bat file.
|
|
|
determines how long the job will run, kills mpirun |
|
|
creates local directories and copies files from H: to T: |
|
|
periodically copies stdout and stderr from T: to H: |
|
|
kills executables on all nodes after mpirun is killed by limiting script |
|
|
copies files from T: to H: after execution stops |
|