Help! Guide for SuperK Outer Detector Troubles


... always under construction !!
Please, check this space again for updates later.
Last update: 28 May 2003 [HGB]

First rule: Don't panic !


INDEX:
What to do when Sukant hangs up and refuses to start a new run?
Updated 2/2001!   Important note about rebooting Sukant
New 11/2000!   How to disable/re-enable a OD PMT channel with the new OD HV paddle (relay) cards?
New 5/2001!   New paddle card shift instruction manual!
Updated 2/2001!   How to handle continuous "The last XX events DIDN'T have anticounter data" error messages from msgctrl?
Updated 2/2001!   What to do when "FSCC problem" error messages from msgctrl don't stop?
What if the orange LED (Fifo VME Busy) on the OD-DAQ LED box stays on during the entire run?
Ken Young's "Primer forTrouble Shooters of the Oddaq"
OD-DAQ Documentation Hyperlinks

If you feel you're too excited to handle the trouble yourself, and you can't locate any help onsite, simply contact the OD experts:

UW onsite representative (at "Settle" apt., Osawano):
0764-67-5517   -   at Osawano apartment "Settle"
NEW since Jan.'99: 0901-636-4087   -   UW cell phone in Japan (mostly in use at KEK, though)

Hans Berns (OD-DAQ electronics & software, radon hut, GPS system):
[0-001-010]-1-206-685-4725 (work)
[0-001-010]-1-206-365-3960 (home)
NEW since 5/13/99: +1-206-713-5919 (mobile phone)
berns@phys.washington.edu

Jeff Wilkes (expert to find Hans):
[0-001-010]-1-206-543-4232 (work)
[0-001-010]-1-206-282-5715 (home)
wilkes@phys.washington.edu

UW group's fax machine in Seattle:
[0-001-010]-1-206-685-9242
But, please check the quick references on this page first. If you can't find anything here, then OD trouble shooting documentation is either available in the SuperK control room(s) or here on the Web. Thanks!

What to do when Sukant hangs up and refuses to start a new run?

You've already tried a cycle of abort/initialize/start and you find that Sukant will not go? Well, the demons (NOVA daemons) may be dead in Sukant and the information here will give you scripts to cleanup and restart (see also Andy's memo):
  1. Go to sukonh and exit the run control completely.

  2. open a xterm window on sukonh

  3. At the prompt type:
     cleanup

  4. When the cleanup script claims that sukant is clean, type:
     initialize
    to restart the run control, just as you always do.
Ok, this procedure will take a few minutes. If all else fails, check the manuals and/or call the online experts.


Important note about rebooting Sukant:

NOTE: sukant is now hooked up to a UPS (Uninterruptable Power Supply). Therefore, it should no longer reboot itself when the power outage was very short, e.g. less than 5 minutes.

Please reboot sukant only if really necessary!

If you found that sukant rebooted during the power outage (e.g. by checking the time since last reboot via the unix command uptime) or if you rebooted sukant by hand, then do the following steps:

If sukant was rebooted (e.g. after a long power outage) then the OD-DAQ VME crate control is at a random state which can lead to crashing the anti-collector at a new run start. The crate controller needs to be properly initialized first before a new run is started:

  1. log into sukant as shift, guest, or online,
  2. turn off OD-DAQ VME crate (red power switch at bottom),
  3. wait approx. 30 seconds,
  4. turn OD-DAQ VME crate back on again,
  5. wait approx. another 30 seconds,
  6. execute VME initialization command
     initvme
  7. if no error messages appear then everything is OK (a few status lines are ok),
    • if you see error messages (with words like "Error" in there), then try steps 2 - 6 again,
      • if that doesn't help, reboot sukant again (for password see instructions in "emergency" envelope at sukant's monitor) and then start with step 1 again.
      • if you still see trouble at the last step contact an OD-DAQ expert. See email addresses and phone numbers above.

How to disable/re-enable a OD PMT channel with the new OD HV paddle (relay) cards?

The OD HV paddle cards were upgraded in November 2000 (hut 1) and April/May 2001 (huts 2-4) with new relay modules. No more jumper pulling! For detailed description of the new cards click here.

Here is the new shift instruction:

  1. Identify the HV channel which powers the faulty OD PMT using tables here, or the wire_back program on kingfish
  2. DISABLE the selected HV channel using the LeCroy HV controller in the appropriate hut.
  3. Change the setting of the dip switch for desired channel on its paddle card
  4. Finally, ENABLE the selected HV channel. Done!
  5. (Don't forget to send Bill Kropp an email about the disabled channel, including base resistance measurement - see procedure below.)
How to find a problem channel after HV trip?
  1. Identify paddle/relay card corresponding to tripped HV channel.
  2. Verify that tripped HV channel is still off, using the LeCroy HV supply's display in the quadrant hut (see manual).
  3. Make a note of the DIP switch settings on the appropriate paddle card.
  4. Turn all DIP switch positions on the selected paddle card to OFF position.
  5. Test the current draw of each individual channel (only the ones that were enabled before, of course). Best if you start from the top channel, i.e. move its DIP switch to ON position, and leave all other DIP switches to OFF, repeat for each PMT in use.
  6. Re-ENABLE the tripped HV channel again. Observe the current values (Meas_uA) on the display of the LeCroy supply.
  7. A healthy PMT has a base resistance of approx. 26 MOhms, i.e. the current should be approx. 59µA at 1600V (including the 1.2 MOhm resistor on the card in series), ~74µA at 2000V, ~88µA at 2400V, etc. (R=V/I, in case you forgot!).
  8. DISABLE the HV channel again.
  9. Repeat steps 5 - 8 until you find the problem channel.
  10. After locating the problem channel, set its DIP switch to OFF, then set all others back to ON that were originally ON before.
  11. Re-ENABLE the HV channel. Double-check that the current is approximately at the expected value (number_of_enabled_channels x target_voltage / 26 MOhms), e.g. 10 enabled channels at 2000V should draw approx. 770µA. If satisfied, then you're done.
  12. (Don't forget to send Bill Kropp an email about the finding, including base resistance.)


How to handle continuous "The last XX events DIDN'T have anticounter data" error messages:

Well, there are lots of possible causes for this type of error message! Here some examples from most recent problems:

==> anti-sender and/or anti-sorter dead ?

Stop the run, then Abort, initialize, finally start.

The anti-sender (or any of the sukon1-9 senders) has been observed to die somewhat frequently(once or twice per week in average) mostly caused (with approx. 99% probability) by a network or communication problem between eventbuilder and sender(s). A dead anti-sender often causes the anti-sorter to run into a Nova buffer memory problem which usually results in a segmentation fault crash...

==> anti-collector dead ?

The anti-collector is pretty touch to kill. If it dies then it usually means some more severe hardware problems with the VME crate, e.g. a power outage or a DC2 module going wacko. Please check the latest error messages on sukant (log in as shift) by If you see error messages such as This typically happens after a power interruption! It is a sure sign that one (or all) DC2 controller(s) located in the rear of the ODDAQ VME crate is (are) in a weird state and needs to be reset. Push the top black button on each of the 4 DC2 modules. You should see the 4 green LEDs on each of the modules cycling on briefly, then off, starting from the bottom to the top. The top green LED should then stay on continuously, then the DC2 is ready for taking data again. [The yellow LEDs should always stay on, I believe...]


What to do when "FSCC problem" error messages from msgctrl don't stop?

Often, a Fastbus crate fails to send data to the OD-DAQ VME crate in the center hut. The anticollector has built-in auto-detection-and-recovery routines that can quickly respond to most Fastbus problems and automatically fix them without user interference. In some cases, the anticollector cannot fix the problem and shift people will see the following symptoms: The possibilities are:
  1. The Fastbus power supply is off?
  2. The Fastbus power supply has a problem, e.g. a failing voltage (+5V, -5.2V, -2V, +15V, -15V)?
  3. The FSCC module (Fastbus Smart Crate Controller) is hung and needs hardware reset?
  4. One or more of the TDC modules do not respond to FSCC bus requests? (see also TDC Test Instruction)
After finding out from above symptoms which hut the problem Fastbus crate is located, please stop the run (abort not necessary), go to that hut and recycle the power to the Fastbus crate (see also FB power supply front panel description): Still the same problem with OD data? Try above procedure a second time. There's a ~5% chance that the recovery didn't work right away.
If the attempts are not successful, then call expert! See phone numbers above.


What if the orange LED (Fifo VME Busy) on the OD-DAQ LED box stays on during the entire run?

LED box

This is a sure indicator that the Anti-Collector has lost control over the OD-DAQ VME crate electronics. Possible reasons:

  1. The NOVA daemon on sukant might be hung up.
  2. The VME crate controller (Bit3 module) might be in an undefined state.

OD-DAQ Documentation Hyperlinks

(to be expanded)

Confused about the strange acronyms in the documentation files?
Click here for the SuperKamiokande Acronym Dictionary.


last edited Wed May 28 13:30:00 PDT 2003 HGB for SuperK US collaboration

If you find any errors in this document or if you have any suggestions regarding ODDAQ trouble shooting, please email to H.G.Berns <berns@phys.washington.edu> and/or R.J.Wilkes <wilkes@phys.washington.edu>