In order to enable an iCal export link, your account needs to have an API key created. This key enables other applications to access data from within Indico even when you are neither using nor logged into the Indico system yourself with the link provided. Once created, you can manage your key at any time by going to 'My Profile' and looking under the tab entitled 'HTTP API'. Further information about HTTP API keys can be found in the Indico documentation.
Additionally to having an API key associated with your account, exporting private event information requires the usage of a persistent signature. This enables API URLs which do not expire after a few minutes so while the setting is active, anyone in possession of the link provided can access the information. Due to this, it is extremely important that you keep these links private and for your use only. If you think someone else may have acquired access to a link using this key in the future, you must immediately create a new key pair on the 'My Profile' page under the 'HTTP API' and update the iCalendar links afterwards.
Permanent link for public information only:
Permanent link for all public and protected information:
28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
NB: Reports were not received in advance of the meeting from:
ROCs: All ROC reports received.
VOs: Alice, BioMed, LHCb
Recording of the meeting
Feedback on last meeting's minutes
<big> Grid-Operator-on-Duty handover </big>
From: Russia / DECH
To: Asia Pacific / SouthEast Europe
Issues from Russian COD:
[Ticket-ID: #26634] SRM problem at YerPhI. Case transfered to political instances.
Issues from DECH COD:
Information for CODs:
Found several tickets were the status on the COD dashboard was set to 'quarantine' despite the fact that SAM tests were failing intermittently. This closed the associated GGUS tickets. Tickets had to be reopened and escalation procedure restarted.
ru-Chernogolovka-IPCP-LCG2 raised alarms for one day, although its status was 'candidate'
host-cert-valid test "violating ftp protocol", patch 'ready for release': https://savannah.cern.ch/bugs/?33257
wn4.epcc.ed.ac.uk is a test DPM endpoint currently not registered in GOCDB associated ticket is #33948
Should we add a link to the operations wiki draft in the doc section of the dashboard?
Information for Operations Meeting:
TAU-LCG2 appears (as usual!) with several COD tickets. Looking at GridView the monthly availability in the last twelve month only once exceeded 15%. Opened ticket #34012. (The other currently open tickets for the site are #32116,#33357)
There are still alarms created for nodes in maintenance (srm, lfc, ..). https://savannah.cern.ch/bugs/index.php?32629 ?
Ticket #33927: how to declare a registry rgma server in gocdb, should it be monitored by SAM?
<big> PPS Report & Issues </big>
PPS reports were not received from these ROCs:
AP, IT, NE, SEE, UKI
Issues from EGEE ROCs:
Glite 3.1.0 PPS Update 21 was released to PPS last Friday and it is now in (advanced) phase of pre-deployment. No major issues found so far.
In particular this update contains
new VOMS-Admin server (2.0.13-1) and client (2.0.6-1): (Added ACL support to command-line client; 9 bugs fixed. Find yours in https://savannah.cern.ch/patch/index.php?1629
new vdt_globus_essentials to fix Globus bug 5771: Mainly of interest for CERN-PROD, fixing hanging processes on submission of SAM RB and WMS tests
New version of lcg-tags: warning messages suppressed
DPM 1.6.7-4 32 and 64 bit: SRM v2 and SRMv2.2 new (fixed) behaviour when creating subdirectories with srmMkdir
(ROC CE): An explanatory text related to an action 147 from 10.03.2008 on Marcin: "Marcin to produce a list of examples where a site failure is attributed to a central service failure."
Site availability calculation relies upon SAM results. We need to be sure that SAM failures corresponds to failures which are at a site side.
In Central Europe region we noticed there are SAM failures for which the site cannot do anything. Examples of failures "non-relevant" to sites
Monitoring infrastructure failures
misconfiguration of a standalone sensorhhttps://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=grid.uibk.ac.at&vo= OPS&testname=CE-host-cert-valid&testtimestamp=1204109361
Grid Core Service failures
temporary outage of central SE: lxdpm101.cern.ch
failure of regional top level BDII
failures of LFC - no SAM example at hand.
We think it should be possible to mark some SAM failures as non-relevant. Such failures should not be taken into account for site availability calculation.
Marking should be possible for monitoring team (SAM failures) but also for site admins and validated by the ROC. Currently we have an interface for site admins and ROCs to set a flag for each individual SAM failure as "relevant - default", "non-relevant" or "unknown" i.e. CIC portal site and ROC reports.
The missing part is the interface with SAM DB and taking the "relevance" field into account during availability calculation.
[Italy] It seems that there were a problem with SAM test results for 3 days (from 14th to 16th). In the availability/reliability metrics of the last week (10-16 March), the absence of SAM result affects the overall site availability metrics. Could someone report about the SAM problem? Will the availability be corrected?
[Russia] Critical issue with unauthorized access to disk space via xrootd service. It does not depends on either DPM or dCache. Any person in the wold who has xrootd client can read and write everything. The single action which can not be done - delete files.
This point completely violate "The Grid Traceability and Logging Policy" (https://edms.cern.ch/document/428037/). I think that this bug absolutely critical from security point of WLCG/EGEE infrastructure and xrootd service must be stoped until the bug will fixed.
See More: https://twiki.cern.ch/twiki/bin/view/LCG/DpmXrootAccess
<big> gLite Release News</big>
gLite3.1 Update17 to production in preparation
The update (to be released very soon) will contain:
the new package glite-LSF_utils (YAIM support for the LSF batc system)
DESY-HH: Short Downtime of CMS mass storage scheduled for next Tuesday March 11th, 9 a.m. to 3 p.m.: dcache-se-cms.desy.de, upgrade to recent patch level and upgarde of the CMS VO box incl. upgrade to recent Phedex version.
News on Development:
ProdAgent v.0.7.1 released (includes: unmerged files clean-up, improved merge operations). Logfiles archiving: coming soon (maybe v.0.8), chained processing: scheduled for June release; dealing with large MySQL DBs: some will come with v0.8.
Validated release: CMSSW_1.6.10 FastSim (avail on Feb, 27), use the std RelVal sample to produce FastSim samples, no problem, mem consumption per job: <mem> ~500 MB, max mem ~1 GB. --- CMSSW_1.7.6 RelVal (avail on 3/3), no problem, mem consumption per job: <mem> ~600 MB, max mem ~1150 MB. --- CMSSW_1.8.0 RelVal (avail on Mar, 5): pre10 vs pre9, <mem> increased by ~250 MB for single particle samples (still within stat errors). --- Lower priority: 1) Pile-up testing: waiting for input from Simulation Group to repeat interactive test which crashed due to excessive memory consumption (1_8_X and 2_0_X). 2) Heavy Ion requested to include samples into RelVal sample sets, work in progress.
Processing at the T0, CAF processing:
GREN reprocessing completed (just 1 merge job failed), not published yet. FastSim complete (100 Mevts) and transferred to FNAL. GRUMM processing started. Suffering from a lack of a sharp policy on dataset naming (the name currently encapsulates plenty of info but still doesn't track everything we need, e.g. we have "time" taken, offline sw version, etc but it will get harder as we add e.g. trigger tables, algorithms that change, etc). Also still lacking from some ProdAgent functionality (cannot smoothly process e.g. subsets of a dataset produced with a given CMSSW version). Work in progress by DM/WM developers (urgent: we are taking data now). --- Analysis on CAF ramping up this week. Data Transfer to 'cmscaf' PhEDEX node OK via new PhEDEx agents. Major issues: 1) hanging LSF CAF jobs (happened to users not registered as LSF CAF users, so 0-priority); 2) long stager callback times for data on cmscaf; 3) increasing number of queued requests (CASTOR team investigating: most likely due to a Castor issue between the default and cmscaf pools). CCRC phase-1 on the CAF was short (few days) but very interesting and promising: post-mortem in progress.
still running old CSA07 signal workflows, ~18 Mevts GEN-SIM processed last week, nmot many arrived to T1's. Some samples too large to be stored at T2 of current capacity: AOD extraction on the way. FastSim production using CMSSW_1.6.9 finished. Coming next: btag skims using CMSSW_1.6.9, foreseen to be runat CERN+FNAL. gLite WMS bulk submission for processing: used on ReReco workflows with PA_0.7.1: submission rate was 4 times faster. --- Site issues: CNAF, FNAL, PIC, RAL: nothing to report; ASGC: some access issues, a problem with the Castor pool was fixed; FZK: unmerged area got full (too much production!), the CLeanUpScheduler works with PA_0.7.1 and will be used to avoid this to happen again; IN2P3: merge jobs were failing, dCache problem, now fixed.
We are now at 710 CSA07 signal workflows done: ~88.7 Mevts (CSA07 Signal requested events) are done, and available for reco. 37+8 workflows for 2.4 Mevts requested to be done (high rate of job failures due to segmentation violation, 8 workflows affected) (11 workflows DONE wrt last week). 13 finished datasets (5 Mevts, 2.45 TB) are subscribed but not transferred to any T1 MSS yet (9 datasets more wrt last week). 1 DPG workflows (2 Mevts): GEN-SIM is done. Still transferring. --- HLT: running (CMSSW_1_7_4, GEN-SIM-DIGI-RAW): 1 big workflows (10 Mevts) in production. Processing is done, now merging. Waiting for 1 more request. --- Detailed and updated summary of current production activities can be found at http://khomich.web.cern.ch/khomich/csa07Signal.html.
Data Transfers and Integrity, DDT-2/LT status:
/Prod transfers: 17 TB/week CERN->T1 (4 T1s) this week. /Debug transfers: >200 TB/week CERN->T1 (5 T1s) this week. New links are commissioning now with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding this week. 82% of the previously commissioned links have already PASSED the new metric as of March 13th. We have 286 commissioned links (as of March 13th). The breakdown is: 55/56 T-T1 crosslinks (only ASGC->RAL is missing); 143 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. Problems reported and often fixed in time to avoid decommissioning. Under the supervision of the FacilitiesOps group, the DDT-TF now uses a Savannah to keep track of site-specific troubleshooting for commissioning/exercising. --- Full details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising.
AOB of the week:
1) Discussion on the T2 analysis associations started, a doc is being circulated. 2) a regular review of CMS-specific SAM-tests will start today, overviewed by FacilitiesOps. 3) Storage space at T1 was reviewed at the FacOps meeting last Friday, and will be summarized at tomorrow's DataOps.