Serial Jobs Run in Parallel

Issue: How can you simultaneously use all the available processors on a set of parallel machines to run independent serial jobs? Your starting point is a serial job that already runs in the CTC batch system.

Solution: There are two different ways to do this. It will be apparent which method will suit you better. We will explain this in detail so that you can run your serial jobs without having to learn about parallel programming. Both methods use the mpirun command, normally used to invoke a parallel program. The first method we will refer to as Deterministic. The second is Farm Out Work. Sample files/templates are provided for each.

Important Consideration:
Since you are potentially running the same application many times, you must guarantee that the input and output are uniquely specified for each serial job you wish to run in parallel. You can do this by: (1) using different names for the files pertaining to each serial run; and/or (2) using different directories for the files pertaining to each serial run. This applies both to the H: and T: drives.

It is to your advantage to implement a clean shutdown of your job. You gain control of what happens in the rest of your .bat file. With a clean shutdown the system takes much less time to free up the machines so that they can be used for another job. The list of tasks in the file farm_out_work_commands.txt may not complete before the time limit for the job is reached. Let's say that you have set a limit of 1440 minutes(24 hours) with the <minutes> xml tag. Furthermore, let's presume that all commands except for the mpirun used to execute the job take less that 900 seconds(15 minutes). The command to use to limit the execution time for the mpirun is H:\CTC Tools\job_limiter.exe, in your default path. It requires two arguments. The first is a number of seconds. The second is the command that you would normally issue. After the specified number of seconds, the command specifed by the second argument and all its child processes are killed.

The original .bat file contains

mpirun -np %NPROC% farm_out_work.exe farm_out_work_commands.txt >stdout.out 2>stderr.err

For a job where you have asked for 1440 minutes(86400 seconds), modify your .bat file to add job_limiter 85500 before the mpirun command.

      job_limiter 85500 mpirun -np %NPROC% farm_out_work.exe farm_out_work_commands.txt >stdout.out 2>stderr.err

If you have any questions, please send email to consult@tc.cornell.edu or call (607) 254-8686 and press 2 to speak to a consultant.