Reprocessing of 2012 data on-going, close to the end with some more merging jobs and re-running of failed jobs.
-
avoided data on FZK Tape as input
-
affected by "data loss" at T1s ( FZK, NDGF -- disk server incidents, RAL -- power cut )
-
ATLAS finally started exercising recovery of a job output by running a single job, rather than re-running a whole task (a long-wished function in our prodsys)
Another set or processing of special stream data from tape is to be defined soon (this month).
follow-up within ATLAS about Frontier raised last week "Frontier: to avoid default TCP timeout in case of service down (WLCGDailyMeetingsWeek121119#Wednesday)"
-
The Frontier client is configured with a 10 second TCP timeout, and try the next Frontier server on the list quickly if the primary Frontier server is down
-
i.e. ok to keep the node down, rather than rebooting with a possibility the Frontier server sending a "keep alive" command potentially causing the time to fail and retry to be longer.
points raised at the last meetings to be followed-up
-
GOCDB: WLCG-ops should review the fall-back system / procedure
-
FTS: affects T2 activities largely. ATLAS requests WLCG-ops to address fall-back solution
-
VOMS-GGUS synchronization for /atlas/team (WLCGDailyMeetingsWeek121029#Friday)
-
Need for alert when OPN switches to backup (WLCGDailyMeetingsWeek121105#Monday)
-
Twiki WLCGCriticalServices to be updated
ATLAS Distributed Computing Tier-1/Tier-2/Tier-3 Jamboree (10-11 December 2012 CERN)
-
https://indico.cern.ch/conferenceDisplay.py?confId=196649