1

Most of the time our torque jobs run fine. Every now and then we get emails saying:

PBS Job Id: 1234.master.example.com
Job Name:   myjob
Exec host:  worker1.example.com/38
An error has occurred processing your job, see below.
request to copy stageout files failed on node
'worker1.example.com/38' for job
1234.master.example.com

Unable to copy file
/var/spool/torque/spool/1234.master.example.com.OU to
/home/someuser/myjob.log,
error 1
*** error from copy
/bin/cp: cannot stat
`/var/spool/torque/spool/1234.master.example.com.OU': No
such file or directory
*** end error output

Now, we have usecp set up correctly, /home is mounted on every machine. And most of the time everything works fine, log files are copied to their destination and there are no error emails. It's only intermittently that we get the error emails. Now the weird thing is, even when we get these error emails, the log files actually exist at the destination we expected them at (eg: /home/someuser/myjob.log). It looks like the log files were copied successfully, except for the email.

What I think may be happening is something like:

  1. The job finishes successfully and copies the log files from /var/spool to the destination on the shared NFS directory successfully.
  2. The log files on the execution host under /var/spool are deleted.
  3. The mom is instructed to run the job exit procedure again (maybe there was a breakdown in communication between the mom and the server and the server didn't think the job exited yet).
  4. The mom tries to copy the log files from /var/spool to the destination on NFS again and fails because they were already deleted in step 2 after the successful copy.

But it's hard to debug because it only happens intermittently.

4

0 に答える 0