PBS Exit Codes
Interpreting PBS Exit Codes
The PBS Server logs and accounting logs record an 'exit status' of jobs.
Zero or positive exit status is the status of the top level shell.
Certain negative exit status are used internally and will never be reported to the user.
The positive exit status values indicate which signal killed the job.
Depending on the system, values greater than 128 (or on some systems 256, see wait(2) or waitpid(2) for more information) are the value of the signal that killed the job.
To interpret (or 'decode') the signal contained in the exit status value, subtract the base value from the exit status.
For example, if a job had an exit status of 143, that indicates the jobs was killed via a SIGTERM (e.g. 143 - 128 = 15, signal 15 is SIGTERM).
See the kill(1) manual page for a mapping of signal numbers to signal name on your operating system.
The exit code from a batch job is a standard Unix termination signal.
Typically, exit code 0 means successful completion.
Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
Exit codes 129-255 represent jobs terminated by Unix signals.
Each signal has a corresponding value which is indicated in the job exit code.
| Signal Name | Signal Number | Exit Type | Reason |
|---|---|---|---|
| SIGHUP | 1 | Term | Hangup detected on controlling terminal or death of controlling process |
| SIGINT | 2 | Term | Interrupt from keyboard |
| SIGQUIT | 3 | Core | Quit from keyboard |
| SIGILL | 4 | Core | Illegal Instruction |
| SIGABRT | 6 | Core | Abort signal from abort(3) |
| SIGFPE | 8 | Core | Floating point exception |
| SIGKILL | 9 | Term | Kill signal |
| SIGSEGV | 11 | Core | Invalid memory reference |
| SIGPIPE | 13 | Term | Broken pipe: write to pipe with no readers |
| SIGALRM | 14 | Term | Timer signal from alarm(2) |
| SIGTERM | 15 | Term | Termination signal |
NOTE : Consult the signal(7) man page for a complete list of signals.
| Exit Code | Reason |
|---|---|
| 9 | Ran out of CPU time. |
| 64 | The job ended nicely for but your job was running out of CPU time. The solution is to submit the job to a queue with more resources (bigger CPU time limit). |
| 125 | An ErrMsg(severe) was reached in your job. |
| 127 | Something wrong with the machine? |
| 130 | The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks. |
| 131 | The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks. |
| 134 | The job is killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger like gdb or Totalview to find out what's wrong. |
| 137 | The job was killed because it exceeded the time limit. |
| 139 | Segmentation violation. Usually indicates a pointer error. |
| 140 | The job exceeded the "wall clock" time limit (as opposed to the CPU time limit). |
Interpreting PBS Error Codes
The error returns possible to a Batch Request (qstat, qdel, qsub)
* Each error is prefixed with the string PBSE_ for Portable (Posix) * Batch System Error. The numeric values start with 15000 since the * POSIX Batch Extensions Working group is 1003.15
| PBS Variable | Error Code | Description |
|---|---|---|
| PBSE_NONE | 0 | no error |
| PBSE_UNKJOBID | 15001 | Unknown Job Identifier |
| PBSE_NONE | 0 | no error |
| PBSE_UNKJOBID | 15001 | Unknown Job Identifier |
| PBSE_NOATTR | 15002 | Undefined Attribute |
| PBSE_ATTRRO | 15003 | attempt to set READ ONLY attribute |
| PBSE_IVALREQ | 15004 | Invalid request |
| PBSE_UNKREQ | 15005 | Unknown batch request |
| PBSE_TOOMANY | 15006 | Too many submit retries |
| PBSE_PERM | 15007 | No permission |
| PBSE_BADHOST | 15008 | access from host not allowed |
| PBSE_JOBEXIST | 15009 | job already exists |
| PBSE_SYSTEM | 15010 | system error occurred |
| PBSE_INTERNAL | 15011 | internal server error occurred |
| PBSE_REGROUTE | 15012 | parent job of dependent in rte que |
| PBSE_UNKSIG | 15013 | unknown signal name |
| PBSE_BADATVAL | 15014 | bad attribute value |
| PBSE_MODATRRUN | 15015 | Cannot modify attrib in run state |
| PBSE_BADSTATE | 15016 | request invalid for job state |
| PBSE_UNKQUE | 15018 | Unknown queue name |
| PBSE_BADCRED | 15019 | Invalid Credential in request |
| PBSE_EXPIRED | 15020 | Expired Credential in request |
| PBSE_QUNOENB | 15021 | Queue not enabled |
| PBSE_QACESS | 15022 | No access permission for queue |
| PBSE_BADUSER | 15023 | Bad user - no password entry |
| PBSE_HOPCOUNT | 15024 | Max hop count exceeded |
| PBSE_QUEEXIST | 15025 | Queue already exists |
| PBSE_ATTRTYPE | 15026 | incompatable queue attribute type |
| PBSE_QUEBUSY | 15027 | Queue Busy (not empty) |
| PBSE_QUENBIG | 15028 | Queue name too long |
| PBSE_NOSUP | 15029 | Feature/function not supported |
| PBSE_QUENOEN | 15030 | Cannot enable queue,needs add def |
| PBSE_PROTOCOL | 15031 | Protocol (ASN.1) error |
| PBSE_BADATLST | 15032 | Bad attribute list structure |
| PBSE_NOCONNECTS | 15033 | No free connections |
| PBSE_NOSERVER | 15034 | No server to connect to |
| PBSE_UNKRESC | 15035 | Unknown resource |
| PBSE_EXCQRESC | 15036 | Job exceeds Queue resource limits |
| PBSE_QUENODFLT | 15037 | No Default Queue Defined |
| PBSE_NORERUN | 15038 | Job Not Rerunnable |
| PBSE_ROUTEREJ | 15039 | Route rejected by all destinations |
| PBSE_ROUTEEXPD | 15040 | Time in Route Queue Expired |
| PBSE_MOMREJECT | 15041 | Request to MOM failed |
| PBSE_BADSCRIPT | 15042 | (qsub) cannot access script file |
| PBSE_STAGEIN | 15043 | Stage In of files failed |
| PBSE_RESCUNAV | 15044 | Resources temporarily unavailable |
| PBSE_BADGRP | 15045 | Bad Group specified |
| PBSE_MAXQUED | 15046 | Max number of jobs in queue |
| PBSE_CKPBSY | 15047 | Checkpoint Busy, may be retries |
| PBSE_EXLIMIT | 15048 | Limit exceeds allowable |
| PBSE_BADACCT | 15049 | Bad Account attribute value |
| PBSE_ALRDYEXIT | 15050 | Job already in exit state |
| PBSE_NOCOPYFILE | 15051 | Job files not copied |
| PBSE_CLEANEDOUT | 15052 | unknown job id after clean init |
| PBSE_NOSYNCMSTR | 15053 | No Master in Sync Set |
| PBSE_BADDEPEND | 15054 | Invalid dependency |
| PBSE_DUPLIST | 15055 | Duplicate entry in List |
| PBSE_DISPROTO | 15056 | Bad DIS based Request Protocol |
| PBSE_EXECTHERE | 15057 | cannot execute there |
| PBSE_SISREJECT | 15058 | sister rejected |
| PBSE_SISCOMM | 15059 | sister could not communicate |
| PBSE_SVRDOWN | 15060 | req rejected -server shutting down |
| PBSE_CKPSHORT | 15061 | not all tasks could checkpoint |
| PBSE_UNKNODE | 15062 | Named node is not in the list |
| PBSE_UNKNODEATR | 15063 | node-attribute not recognized |
| PBSE_NONODES | 15064 | Server has no node list |
| PBSE_NODENBIG | 15065 | Node name is too big |
| PBSE_NODEEXIST | 15066 | Node name already exists |
| PBSE_BADNDATVAL | 15067 | Bad node-attribute value |
| PBSE_MUTUALEX | 15068 | State values are mutually exclusive |
| PBSE_GMODERR | 15069 | Error(s) during global modification of nodes |
| PBSE_NORELYMOM | 15070 | could not contact Mom |
| PBSE_NOTSNODE | 15071 | no time-shared nodes |
| Resource monitor specific | ||
| PBSE_RMUNKNOWN | 15201 | resource unknown |
| PBSE_RMBADPARAM | 15202 | parameter could not be used |
| PBSE_RMNOPARAM | 15203 | a parameter needed did not exist |
| PBSE_RMEXIST | 15204 | something specified didn't exist |
| PBSE_RMSYSTEM | 15205 | a system error occured |
| PBSE_RMPART | 15206 | only part of reservation made |
| RM_ERR_UNKNOWN | PBSE_RMUNKNOWN | |
| RM_ERR_BADPARAM | PBSE_RMBADPARAM | |
| RM_ERR_NOPARAM | PBSE_RMNOPARAM | |
| RM_ERR_EXIST | PBSE_RMEXIST | |
| RM_ERR_SYSTEM | PBSE_RMSYSTEM | |


