It has been noticed that some information has been periodically
disappearing from the information system. This happens due to the
various time-outs. There is a time-out on the top-level BDII querying
the site-level BDII. The site-level BDII querying the GRIS and the
information provider. Time-outs are there for a reason, they protect
the system from queries that are taking far too long to return under
normal circumstances.
We need to improve the monitoring of the information in the information
system to spot these kind of things. As a rule of thumb, if the whole
site disappears from the information system it is a time-out while
querying the site-bdii. If only one service disappears, it is usually
and information provider. If we spot some information disappearing, it
needs to be investigated as it usually points to some low level fabric
related problems.
FZK seems to be the worst site at the moment
Speaker:
Laurence Field