Promote Speed and Job Success: Write to Local Disk
CTC Monthly Tips
January 2004
Revised June 2006 for vsched
 

Tip  All batch jobs should read and write to local disk (T:) instead of using the remote fileserver (H:).
 
Audience   Everyone running batch jobs on Velocity.
 
Issue   Reading and writing files locally is faster and safer than doing I/O over the network. In addition to significant speedups, it can mean the difference between a job that works and one that fails, either because the job ran out of time or a network hiccup occurred while a file being accessed over the network was open at the time of the interruption.
 
Solution   Each compute node has a local disk (T:). All I/O during your job run should be to T:. This requires minor changes to your batch script, but could save you significant execution time and might be the difference between your job completing successfully and failing. In other words, your batch script should do the following:
  1. copy input files and the executable from H: to T:;
  2. use T: for all read/write operations during the course of the run, including both your data files and standard output files;
  3. copy output files from T: to H: when your run is finished.
Speed-up Example   An experiment was performed using two versions of a C++ code. The first version wrote 185 MB of data to T: and then copied this file back to H:. The second version wrote the 185 MB output file directly to H: during the job run. Following are the results of this experiment, which we believe offers convincing evidence that writing directly to H: is counterproductive.

 
 
Total CPU
I/O CPU
File Transfer
Total I/O
Write to T:
669 secs
40 secs
47 secs
87 secs
Write to H:
1152 secs
523 secs
0 secs
523 secs
 
Details The main issue is to read and write directly to T:. Check your program and batch script for these situations:
  1. If your program reads from or writes to files other than standard output, check for explicit file paths to your folder on H:. Not only is it harder to read and write directly from H:, but your job will take longer to run, increasing the potential for network interruptions that can cause your job to hang and result in job failure. Specifically:
    • Does your program use input files? If so, copy those files to T: at the beginning of the batch script and modify your program to read the local input file.
    • Does your program write output files? If so, modify your program to write the input file to T: and copy those files to H: after your program has ended.
  2. Do you write standard output and/or standard error directly to H:? If so, change your script to write those files to T:.

    REM WRONG! Do not write output files directly to H:
    quick.exe 1>\\tc.cornell.edu\tc\users\%USERNAME%\quick.out 2>\\tc.cornell.edu\tc\users\%USERNAME%\quick.err
REM Change to this:
quick.exe 1>T:\%USERNAME%\quick.out 2>T:\%USERNAME%\quick.err               
Sample Batch Scripts
 
speed_test.xml

<?xml version="1.0" ?>
<!-- Sample XML Job File -->
<job>
<nodes>1</nodes>
<minutes>20</minutes>
<type>batch</type>
<affiliation>development</affiliation>
<run>\\tc.cornell.edu\tc\users\your_userid\speed_test.bat</run>
</job>

speed_test.bat

cd /D T:\
del /Q T:\%USERNAME%
mkdir %USERNAME%
cd %USERNAME%
REM Executable copied from H: to T:
copy \\tc.cornell.edu\tc\users\%USERNAME%\quick.exe T:\%USERNAME%\quick.exe
REM Input file(s) copied from H: to T:
copy \\tc.cornell.edu\tc\users\%USERNAME%\input.txt T:\%USERNAME%\input.txt
 
REM HIGHLY RECOMMENDED
REM Notice standard output and standard error are written to T:
REM Output files generated by the program should go to T: as well

quick.exe 1>T:\%USERNAME%\quick.out 2>T:\%USERNAME%\quick.err
REM After the run, copy standard output and standard error to the user folder on the fileserver, (H:)
copy quick.* \\tc.cornell.edu\tc\users\%USERNAME%
REM Also copy output files generated by your program to the user folder on the fileserver, (H:)
copy quickOUTPUT.txt* \\tc.cornell.edu\users\%USERNAME%
 
REM Normal job cleanup, remove all files from T: at the end of your run.
del /S /Q T:\%USERNAME%
vsched -c
     
Note
 
If the program ends without any output being written back to H:, there could be several causes, e.g.
 
  1. The program was running fine, but the user underestimated how long it would take.
  2. The program was not running well, and was reporting to stdout and/or stderr about its problems, but ran out of time because of its problems.
  3. The program hung without saying anything.

    A simple way to diagnose this type of problem is to remove "vsched -c" from your batch script, submit the job, then once the job has begun, make a remote connection to the compute node, check which files have been copied and created, and try running your program interactively on your compute node.
 
References