 |
Help! Guide for SuperK Outer Detector
Troubles |
... always under construction !!
Please, check this space again for updates later.
Last update: 28 May 2003 [HGB]
|
|
|
First rule:
Don't panic !
|
INDEX:
What to do when Sukant hangs up and refuses to
start a new run?
Updated 2/2001!
Important note about rebooting Sukant
New 11/2000!
How to disable/re-enable a OD PMT channel with the
new OD HV paddle (relay) cards?
New 5/2001!
New paddle card shift
instruction manual!
Updated 2/2001!
How to handle continuous "The last XX events DIDN'T
have anticounter data" error messages from msgctrl?
Updated 2/2001!
What to do when "FSCC problem" error messages from
msgctrl don't stop?
What if the orange LED (Fifo VME Busy) on the OD-DAQ
LED box stays on during the entire run?
Ken Young's "Primer forTrouble Shooters of the Oddaq"
OD-DAQ Documentation Hyperlinks
If you feel you're too excited to handle the trouble yourself,
and you can't locate any help onsite, simply contact the OD
experts:
- UW onsite representative (at "Settle" apt., Osawano):
- 0764-67-5517 - at Osawano apartment "Settle"
- NEW since Jan.'99: 0901-636-4087 -
UW cell phone in Japan (mostly in use at KEK, though)
- Hans Berns
(OD-DAQ electronics & software, radon hut, GPS system):
- [0-001-010]-1-206-685-4725 (work)
- [0-001-010]-1-206-365-3960 (home)
- NEW since 5/13/99: +1-206-713-5919 (mobile phone)
- berns@phys.washington.edu
- Jeff Wilkes
(expert to find Hans):
- [0-001-010]-1-206-543-4232 (work)
- [0-001-010]-1-206-282-5715 (home)
- wilkes@phys.washington.edu
- UW group's fax machine in Seattle:
- [0-001-010]-1-206-685-9242
But, please check the quick references on this page first. If you can't find anything here,
then OD trouble shooting documentation is either available in the SuperK control room(s)
or here on the
Web. Thanks!
You've already tried a cycle of abort/initialize/start and you find that
Sukant will not go? Well, the demons (NOVA daemons) may be dead in
Sukant and the information here will give you scripts to cleanup and
restart (see also Andy's memo):
- Go to sukonh and exit the run control
completely.
- All of the large green panels should disappear and
there should be none left on the screen of any type.
- open a xterm window on sukonh
- At the prompt type:
cleanup
- This will start a shellscript which has the ability to clean up
every DAQ workstation, if desired. It will prompt the user 12
times with the question:
"Clean up <workstation>. Ok ? [y/n]" for
every workstation from sukon1 to sukon9, plus sukonh, kingfish, and
sukant. It takes about 30 seconds to 1 minute for each host, so don't
waste time answering "y" unless you believe the workstation in
question really has a problem.
- When the cleanup script claims that sukant is
clean, type:
initialize
to restart the run control, just as you always do.
Ok, this procedure will take a few minutes. If all else fails, check
the manuals and/or call the online experts.
NOTE: sukant is now hooked up to a UPS (Uninterruptable Power
Supply). Therefore, it should no longer reboot itself when the power
outage was very short, e.g. less than 5 minutes.
Please reboot sukant only if really necessary!
If you found that sukant rebooted during the power outage (e.g. by
checking the time since last reboot via the unix command uptime)
or if you rebooted sukant by hand, then do the following steps:
If sukant was rebooted (e.g. after a long power outage) then the OD-DAQ
VME crate control is at a random state which can lead to crashing the
anti-collector at a new run start. The crate controller needs to be
properly initialized first before a new run is started:
- log into sukant as shift, guest, or online,
- turn off OD-DAQ VME crate (red power switch at bottom),
- wait approx. 30 seconds,
- turn OD-DAQ VME crate back on again,
- wait approx. another 30 seconds,
- execute VME initialization command
initvme
- if no error messages appear then everything is OK (a few status lines are
ok),
- if you see error messages (with words like "Error" in there), then try steps 2 - 6
again,
- if that doesn't help, reboot sukant again (for password see instructions
in "emergency" envelope at sukant's monitor) and then start with
step 1 again.
- if you still see trouble at the last step contact an OD-DAQ
expert. See email addresses and phone numbers above.
The OD HV paddle cards were upgraded in November 2000 (hut 1) and
April/May 2001 (huts 2-4) with new relay modules. No more jumper
pulling! For detailed description of the new cards click here.
Here is the new shift
instruction:
- Identify the HV channel which powers the faulty OD PMT using tables
here, or the wire_back program on kingfish
- E.g. tube 408 = paddle no. 1.2.15.1 (hut.crate.card.channel) = HV channel 2.10 in hut 1.
- DISABLE the selected HV channel using the
LeCroy HV controller in the appropriate hut.
- Change the setting of the dip switch for desired channel on its paddle card
- top = channel 1, bottom = channel 12; left position = ON, right
position = OFF
- use a small screwdriver or toothpick to move the dip switch - if you
can't move it with one of your fingers.
- For confirmation, the red LED next to the LEMO monitor jack of that
channel shows whether the relay is on or off.
Note: The LEDs monitor the relay ON/OFF status
only, not whether there is HV or not!
- Finally, ENABLE the selected HV channel. Done!
- (Don't forget to send Bill Kropp an email about the disabled
channel, including base resistance measurement - see procedure
below.)
How to find a problem channel after HV trip?
- Identify paddle/relay card corresponding to tripped HV channel.
- E.g. use wire_back program on kingfish.
- Verify that tripped HV channel is still off, using the
LeCroy HV supply's display in the
quadrant hut (see
manual).
- Make a note of the DIP switch settings on the appropriate paddle card.
- Turn all DIP switch positions on the selected paddle card to OFF position.
- Test the current draw of each individual channel (only the ones that
were enabled before, of course). Best if you start from the top
channel, i.e. move its DIP switch to ON position, and leave all other
DIP switches to OFF, repeat for each PMT in use.
- Re-ENABLE the tripped HV channel again. Observe the current values
(Meas_uA) on the display of the LeCroy supply.
- A healthy PMT has a base resistance of approx. 26 MOhms, i.e. the
current should be approx. 59µA at 1600V (including the 1.2 MOhm
resistor on the card in series), ~74µA at 2000V, ~88µA at 2400V, etc.
(R=V/I, in case you forgot!).
- If the current is far above the expected value then it must be
the problem channel! E.g. a current of ~1600µA at 2000V means that
the base is shorted.
- DISABLE the HV channel again.
- Repeat steps 5 - 8 until you find the problem channel.
- After locating the problem channel, set its DIP switch to OFF,
then set all others back to ON that were originally ON before.
- Re-ENABLE the HV channel. Double-check that the current is
approximately at the expected value (number_of_enabled_channels x
target_voltage / 26 MOhms), e.g. 10 enabled channels at 2000V should
draw approx. 770µA. If satisfied, then you're done.
- (Don't forget to send Bill Kropp an email about the finding,
including base resistance.)
Well, there are lots of possible causes for this type of error message!
Here some examples from most recent problems:
==> anti-sender and/or anti-sorter dead ?
Stop the run, then Abort, initialize, finally start.
The anti-sender (or any of the sukon1-9 senders) has been observed to die
somewhat frequently(once or twice per week in average) mostly caused
(with approx. 99% probability) by a network or communication problem
between eventbuilder and sender(s). A dead anti-sender often causes the
anti-sorter to run into a Nova buffer memory problem which usually
results in a segmentation fault crash...
==> anti-collector dead ?
The anti-collector is pretty touch to kill. If it dies then it usually
means some more severe hardware problems with the VME crate, e.g. a power
outage or a DC2 module going wacko. Please check the latest error
messages on sukant (log in as shift) by
- tail /home/online/log/anticollector.errorlog
or
- tail /home/online/log/anticoll.newlog
If you see error messages such as
This typically happens after a power interruption!
It is a sure sign that one (or all) DC2
controller(s) located in the rear of the ODDAQ VME crate is (are) in a
weird state and needs to be reset. Push the top black button on each of
the 4 DC2 modules. You should see the 4 green LEDs on each of the modules
cycling on briefly, then off, starting from the bottom to the top. The
top green LED should then stay on continuously, then the DC2 is ready for
taking data again. [The yellow LEDs should always stay on, I
believe...]
Often, a Fastbus crate fails to send data to the OD-DAQ VME crate in the
center hut. The anticollector has built-in auto-detection-and-recovery
routines that can quickly respond to most Fastbus problems and
automatically fix them without user interference. In some cases, the
anticollector cannot fix the problem and shift people will see the
following symptoms:
- One of the "Hut N TDCs BUSY" LEDs on the ODDAQ
hardware monitor box is not in sync with the other 3 LEDs.
- The anti-sorter complaints with frequent warning messages via
msgctrl on sukonh repeated once every minute or two minutes, e.g.
|
Tue Feb 6 11:36:06 2001/sukant::collector/hut 2 TDC/FSCC problems;
starting automatic 50-sec FSCC reset now!
|
- On the anticollector@sukant monitor window on sukonh you see frequent
error messages popping up such as
===> TDCs busy for >.1 sec in hut 2 !!
===> No data in OD TDCs, automatic fast FSCC reset executed.
|
In this example, there's no data coming from hut 2. So, the problem
TDC is in hut 2!
The possibilities are:
- The Fastbus power supply is off?
- The Fastbus power supply has a problem, e.g. a failing voltage
(+5V, -5.2V, -2V, +15V, -15V)?
- The FSCC module (Fastbus Smart Crate Controller) is hung and needs
hardware reset?
- One or more of the TDC modules do not respond to FSCC bus requests?
(see also TDC Test Instruction)
After finding out from above symptoms which hut the problem Fastbus crate
is located, please stop the run (abort not necessary), go
to that hut and recycle the power to the Fastbus crate (see also
FB power supply front panel
description):
- flip "CNTRL PWR ON" button to "off" position,
- wait approx. 30 seconds,
- flip "CNTRL PWR ON" button back to "on" position,
- Wait another 30 seconds or so (for letting the FB controller reboot)
before restarting the run. Done!
Still the same problem with OD data? Try above procedure a second time.
There's a ~5% chance that the recovery didn't work right away.
If the attempts are not successful, then call expert!
See phone numbers above.
This is a sure indicator that the Anti-Collector has lost control over
the OD-DAQ VME crate electronics. Possible
reasons:
- The NOVA daemon on sukant might be hung up.
- The VME crate controller (Bit3 module) might be in an undefined
state.
(to be expanded)
Confused about the strange acronyms in the documentation files?
Click here for the
SuperKamiokande Acronym Dictionary.
last edited
Wed May 28 13:30:00 PDT 2003
HGB for
SuperK US collaboration
If you find any errors in this document or if you have any suggestions
regarding ODDAQ trouble shooting, please email to
H.G.Berns
<berns@phys.washington.edu> and/or
R.J.Wilkes
<wilkes@phys.washington.edu>