28-R-15 (CERN conferencing service (joining details below))
28-R-15
CERN conferencing service (joining details below)
Description
grid-operations-meeting@cern.ch Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
(ROC CE): Two questions about availability calculation.
a) Could we present what fraction of unavailability periods is considered by sites as non-relevant? Site admins fills in weekly reports and put such an information about each individual SAM test failure so the data is there. In our view this information can allow to identify areas to improve in terms of availability.
b) Would it be possible to implement mechanisms for automatic removal of periods in which sites failed due to some monitoring-related problems like this one: https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=grid.uibk.ac.at&vo=OPS&testname=CE-host-cert-valid&testtimestamp=1204109361
<big> gLite Release News</big>
gLite3.1 Update16 was released to production today
The update contains:
A new index on the attribute GlueServiceEndpoint, used by lcg-utils
UI: Bug fixes to jdl API (bulk submission) and gfal cliens
dcache SE: Glue 1.3 clean ups and bug fixes
DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP from 2.6.2 -> 2.7.6b
GFAL version 1.10.8-1: creation of subdirectories with lcg-utils
(France ROC): A lesson learnt from CCRC08 is that some VOs don''t mind the status published by a CE queue, so that they can wrongly submit on queue with a non-Production status. Indeed, at IN2P3-CC, for the purpose of a Atlas-Cms combined test, we had set 2 queues with a status "TEST" in order to restrict access to jobs that had explicitely required this status, but after a while we noticed plenty of regular (I mean "production") jobs on those queues. Please check the queue status before submitting, it must be set at "Production".
<big>WLCG Service Interventions (with dates / times where known) </big>
Request to Atlas sites to upgrade WNs to SL4
List of siteshaving CEs that atlas can use, by OS:
http://straylen.web.cern.ch/straylen/tmp/atlas-sites-by-os.txt
more information
<big> CMS report </big>5m
Data certification, T0 status and reprocessing:
all activities suffered from the LSF incident (full log by CMS at https://twiki.cern.ch/twiki/bin/view/CMS/FacOps-IncidentCERNLSF-Feb28Mar07, discussed with Bernd/Ulrich at the FacOps meeting - see bottom of http://indico.cern.ch/conferenceDisplay.py?confId=30054). Hard week for RelVal atCERN, also (the LSF issue left CMS behind in release validations). FastSim production was proceeding fast before the problems (6k/15k proc jobs complete), and recovered soon after. --- Good progress on the StorageManager side, identified and configured the nodes to be used in the Global Run inMarch.
Re-processing:
on CSA07 signal workflows, ~6M of GEN-SIM input evts have just arrived at T1's; ~17M processed evts last week. Processing running at FNAL also. FastSim production finalized with CMSSW_1.6.9 (+ 2 additional tags for the config files CMSSW_1.6.10) about ~100M PDAllEvents from the 3 soups (RelVal samples). No site issues at ASGC, CNAF, FZK, PIC, RAL; at FNAL, jobs take too long due to a dCache issue, being investigated; at IN2P3, problems in the pool area, several days without being able to merge jobs, now solved and production is already back on-schedule. --- Ran some post-CCRC reprocessing jobs with ATLAS: some lessons learned at IN2P3 and PIC (long to report here).
MC production:
~85M CSA07 Signal requested events were done, now available for reco. 56 workflows for ~3M requested events still to be done. Two types of problems (all CMSSW-related, so it worths no mention here). 4 finished datasets (4M events, 1.45TB) are subscribed but not yet transferred to any T1 MSS. --- 1 DPG workflow (2 Mevts): GEN-SIM is done. Transferring. --- HLT: running (it's CMSSW_1_7_4, GEN-SIM-DIGI-RAW), 1 big workflows (10 Mevts) in production now, ~2 Mevts are done. --- Detailed summary of current production activities at http://khomich.web.cern.ch/khomich/csa07Signal.html.
Data Transfers and Integrity, DDT-2/LT status:
/Prod transfers: proceed, 16 TB/week this week, no major problems. /Debug transfers: new links are commissioning with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding, generally very successfully: 78% of the previously commissioned links have already PASSED the new metric as of 6 March 6th. We have 285 commissioned links (as of March 6th). The breakdown is: 55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing); 142 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. First round of testing almost complete. Sites can take advantage of the gap before the second round to commission new links or recommission failed links. Real problems found, fixed during exercising, first "success stories" in troubleshooting being documented. --- Full details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising.