-- HarryRenshall - 19 Feb 2008

Week of 080218

Open Actions from last week:

Monday:

See the weekly joint operations meeting

Tuesday:

elog review: Two new problems so far this week. A possible intermitent FTS bug showing up in FZK to CNAF transfers and CMS transfers to PIC slowing down due to ATLAS filling the SRM request queue with retries of failed transfers.

Experiment report(s):

CMS - continuing T0 to T1 tests. Will include most of the T1 this week and all regions should be exercising their T2. There is a CMS week next week when we will look at post-mortems of some of our activities that did not reach their metrics.

LHCb - plan this week to ramp up to half their full CCRC rate. As soon as data reaches the T1 reconstruction jobs are automatically launched.

ALICE - Transferring again to dcache sites, currently only FZK. SARA will be next but they need a new ALICE VO-box.

ATLAS - Following the Monday operations meeting M.Ernst of BNL was unhappy with the handling of the various transfer problems seen by ATLAS since 14 Feb. We will do an analysis and report back to ATLAS.

Core services (CERN) report: The problem of delegated proxies appearing corrupted on FTS servers (which would last for 8 hours before timing out) has been traced to a race conditions where two clients try to create the delegated proxy in the same time window. Two client-side solutions were proposed 1) always get a fresh my_proxy as part of the request. This does not get voms extensions so would not work for ATLAS. 2) experiments should periodically (maybe hourly) refresh the delegated proxy - a recipe was given. In addition at CERN a cron job looking for, and removing, corrupt proxies is being prepared.

DB services (CERN) report: No problems so far. ATLAS have requested an intervention on 26 Feb to complete the Oracle critical patches and also the fix to the problem affecting streams replication of compressed conditions data from the online to offline databases. We will check with LHCb if they would also accept a 2 hour intervention on this date (LHCb Tier 1 LFC updates will be delayed during this period but then should only need a few minutes to resynchronise) otherwise we can reschedule for March.

Monitoring / dashboard report:

Release update: Their will be a dcache patch release today or tomorrow which will fix all currently known problems.

Questions/comments from sites/experiments:

AOB:

Wednesday

elog review: New entries from all experiments - see their reports.

Experiment report(s):

CMS - (AS) at 03:30 transfers stopped due to a bug in Phedex associated with a bad local configuration file. CMS observed that from FNAL they could not ping the CERN RAC servers but this is our normal firewall configuration.

LHCb - (RS) ran yesterday a planned 6 hour test of data from pit to T0 to T1 where reconstruction was run. They will decide their next steps later today, perhaps a 24 hour run. Several jobs lost their data connections at IN2P3, possibly due to a failing dcap door. At FZK they need to recreate their shared software area and at CNAF jobs are not running but the reason is not known.

ALICE - (PM) Transfers to FZK resumed overnight then stopped again. SARA had deployed a new VO-box but there were no transfers. Corrupted proxies were found and replaced and FZK transfers have resumed but there are still none to SARA. At CNAF the VO-box was having NFS problems and NDGF is waiting the install of the gLite UI.

ATLAS - (SC) the FTS corrupted proxy happened again at 05:30 CET and was fixed at 09:30. The client workaround reported yesterday has now been installed at CERN and the ATLAS Tier 1. Currently transfers are only running at 10-20% as a trigger-DAQ activity has started, but hopefully this will not be for long. The nominal export rate for ATLAS has now dropped from 1200 MB/sec to 900 MB/sec as BNL will no longer take the full ESD. M.Ernst of BNL said this was wrong - they will take the full ESD but store some on tape instead of putting it all on disk. To be clarified within ATLAS. He also reported still seeing a 50% failure rate on transfers, many with file does not exist. SC said there was a CASTOR problem reporting disk full at source which he thought implied a failed disk-to-disk copy. The CASTOR team is investigating.

Core services (CERN) report: The corrupted FTS proxy problem is now understand and a Savana bug has been entered. The corrupted proxy workaround recipes mentioned yesterday have been put under CRC08 service issues Twiki. Experiments should deploy one of the recipes at CERN and all Tier 1 sites. The solution of running a frequent system cron looking for and removing corrupt proxies is about to be deployed at CERN.

DB services (CERN) report: The intervention on the ATLAS and LHCb databases reported yesterday is now confirmed for the 26th Feb. There will be a two hour delay in LFC replication for LHCb but which should catch up quickly.

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments:

AOB: The detailed time-line post-mortem made by the CASTOR team of the various problems with Tier 0 experts between 14 and 19 February are attached in the Twiki Post-mortem of events starting Feb 14th.

Thursday

elog review: Four new entries but not new problems.

Experiment report(s):

CMS (AS): We are asking CASTOR operations to increase the T1-export pool size. We would like to sustain a stable rate out of CERN of over 500 MB/sec to the maximum possible number of Tier 1.

ATLAS (SC): Tier 0 export restarted this morning at full steam and we are also replicating ESD data between Tier 1 sites. We will continue like this till Monday then decide what to do. Note that the cosmics M6 run starts in the first week of March.We would like CASTOR operations to increase the garbage collection time window from 24 to 48 hours.

LHCb (RS): We are struggling to stabilise our machinery by running 6 hour tests. Seeing various problems at IN2P3. Also we find that deleting files from the CASTOR nameserver does not delete them from the disk cache. M.Santos said this was a feature of CASTOR 2.1.4, fixed in 2.1.6. LHCb should do a stager_rm as well as the nsrm while using 2.1.4.

ALICE (PM):

    1. Yesterday I submitted a ticket via elog to explain that Alice had no access to the srm service at SARA. "The path was not recognized" was the message in the error. Having a look to the problem I realized that the srm endpoint was wrong configured from the Alice part, so I reported it to Alice and the message dissappeared. I replied the corresponding ticket this morning to close it.
    2. However I opened a new one this morning also related to SARA. It seems Alice does not have space available at the site. This issue should be followed.
    3. Just after the meeting I will get in touch with all site managers of the Alice T1 sites to apply for the FTS client workaround to solve the problem with the corrupted proxies following Gavin"s email and Harry"s minutes.
    4. During the TF meeting we will discuss also the situation with the castor2 sites, and NDGF status, so I will refresh you these points after the meeting.

Core services (CERN) report:

DB services (CERN) report:

  • This morning around 11:00 we had to restart node 2 of the CMS database cluster because this node was not responding to requests properly. The reason for this is still being investigated based on the server logs.
  • The CMS service stayed up (using the other 7 nodes) and only a few sessions from the CMS dashboard application have been affected by the restart.

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments:

AOB:

Friday

elog review: under experiment reports.

Experiment report(s):

ALICE (PM): SARA space problem was fixed yesterday and ALICE have been transferring to them at a steady 25 MB/sec. There is a problem with writing to dcache with xrootd that should be fixed on Monday. CNAF are still trying to solve an nfs problem with the VO-box. There is no Alice representative at Lyon to act on the corrupted proxy issue - HRR will follow up.

ATLAS: are seeing many problems with inter-Tier 1 transfers. From the elog PIC and BNL have corrupted proxies. HRR will tell them that the workarounds have been documented (by Gavin) under the CCRC'08 service issues at https://twiki.cern.ch/twiki/bin/view/LCG/ServiceIssuesFtsProxyCorruption

CMS (AS): The Phedex problem happened again but with much smaller impact.

LHCb (by mail from N.Brook):

    1. we had problems with dCache at IN2P3. We had a problem it FTS transfers failing to LCb_RAW (T1D0). It turned out that it seemed to be that reserved space for the spacetoken was totally allocated. We had deleted files and the reservation was not released. Can other dCache sites be made aware of this.
    2. We are still having issues at IN2P3 with dcap doors and losing connecting to data servers when running applications. This causes a crash at the application - this can't be caught at the LHCb application level.
    3. In general we have be running transfers successfully. The data processing has gone less smoothly. Issues on our side as well as the aforementioned dCache issue.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments:

AOB:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2008-02-22 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback