Outdated Resource Warning!
This is documentation for
legacy version 0.x releases. If you are using 1.x see the
wiki.
Troubleshooting
The easiest way to detect and locate problems is through the system log, available in the admin page. The lower the "log level" is for an event the more significant the event. A 1 denotes a serious/major event and a 0 would usually be a fatal error.
The "log.level" system variable (set through the admin page) is the highest level for which events are logged. In normal operation keeping this at 4 or 5 will ensure major events are logged but spurious information is not. This can be set higher for debug purposes.
Events log at a level of 0 to 10 - 0 is a fatal error and 10 is a very minor piece of information.
One or More Nodes Are Not Being Tested
Check the nodes are enabled. Check the system log for "tester already running" events and if found follow the "hung" process below. You can also set the log.level to 10 and see the process step-by-step in the system log (or manually - see the advanced section).
A Test Session Has "Hung"
If a test session has totally hung it may need to be killed from the Linux shell. More likely the process has died but left the node marked as still being tested. To clear these test sessions (the database locks) you can use the "test sessions" option on the admin page, click through the session in question and select manual close.
I am getting false alarms or
Sometimes there is link trouble but the server is up
You can manually alter the settings for number of retries and timeouts for specific tests in their dialog (for more information see the tests documentation). You can mitigate false/network problem alerts with these settings and find a balance for speed-of-alert versus the robustness of the test. The node ping tests are a site-wide setting in the variables (as is the site-wide http timeout default) - see the administration section of the documentation.
Let us know if you find any more common problems and we'll list any here.