US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-02-17T13:00:00-05:00
End: 2016-02-17T15:20:00-05:00
Location: your office

Wednesday 17 Feb 2016, 13:00 → 15:20 US/Eastern

virtual room (your office)

virtual room

your office

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  Top of the Meeting 15m
  
  Speaker: Robert William Gardner Jr (University of Chicago (US))
  OSG AHM reminder: https://indico.fnal.gov/conferenceDisplay.py?confId=10571 , and US ATLAS Facilities meeting, March 14-17, 2016, Clemson University.
  
  ESnet LHCONE site coordinator meetings
  
  OU - Friday, Feb 12, 12 pm CST done
  
  IU - Thursday, Feb 11, 9am CST done
  
  UTA: Wednesday, Feb 17, 2pm CST (today)
  
  BU - being discussed - NOX versus MIT at MANLAN; will check status bi-weekly
  
  Duke - (likely this Friday)
  
  UT/TACC - TBD, pending LEARN-Esnet peering
  
  OSG 3.3 upgrade discussion. 3.2 will be deprecated.
- 13:15 → 13:25
  
  Capacity News: Procurements & Retirements 10m
- 13:25 → 13:35
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-2_10_16.pdf
  
  shift-summary-2_17_16.pdf
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  160217_DataManagement_Armen.pdf
- 13:40 → 13:45
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:45 → 13:50
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:50 → 13:55
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15 → 15:15
  Site Reports
  - 14:15
    BNL 5m
    
    Speaker: Michael Ernst
    
    Smooth operations for the last 2 weeks
    
    Running at capacity, mostly MCORE (production) jobs
    
    Heavily dominated by reprocessing jobs (2015 data) for the last ~10 days
    
    Disk storage is tight, free space is <1 PB
    
    Preparing for the AWS 100k core scale test with the Event Service that is scheduled for next week
    
    min/max RSS implemented according to ADC request
  - 14:20
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    No news is good news? We are running well.
    
    Preparation is ongoing for a full software update at AGLT2. This will include the glibc fixes, that should appear overnight tonight in our rpm repos. osg-wn client will be 3.3.8 (but may move to 3.3.9....), cvmfs will be 2.1.20, HTCondor will be 8.4.3. Osg-ce work on our test gatekeeper is ongoing today, but to today's current 3.3.9 version.
    
    dcap rpms and lcg-util rpms, that are still needed by our lsm* utilities, were taken from EPEL.
    
    Next week we are planning a likely upgrade of dCache from the 2.10 series, to the 2.13 series.
    
    Note that OSG rpm suites require java 1.7, but if java 1.8 is also installed, and set to be the default, the OSG software will still run fine so we will put java 1.8 as the default everywhere. See osg ticket https://ticket.opensciencegrid.org/28484
  - 14:25
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site has been running well
    
    Full of Atlas jobs (MCORE, SCORE, Analy and Opportunistic)
    
    Good efficiency
    
    Illinois down for PM on campus cluster
    
    Updated glibc pushed to all nodes
    
    New Disk at UChicago
    
    Ceph based
    
    Migrating LOCALGROUPDISK
    
    Lincoln is using gfal-copy and srm to copy from dCache to Ceph but slow
    
    Currently migrated 193TB out of 368TB
    
    Need Kernel 4.4 to fix controller problems
    
    OSG 3.3.9
    
    All head nodes have been using 3.3.x stack for a long time without problems
    
    CE (HTCondorCE)
    
    Squid
    
    CVMFS servers/clients
    
    GUMS
    
    Condor 8.4.3
    
    Still using 3.2.35 on worker nodes
    
    Testing new LSM
    
    Used GFAL2 via xrootd, then srm, the fax to try and stagein the file
    
    minRSS and maxRSS now set
    
    MCORE needs 24GB for reprocessing jobs
    
    Changed HTCondorCE to request RSS of 24GB (previously 16GB)
    
    Many nodes are only 2G/core so this can cause idle cores due to lack of free memory
    
    Might create MWT2_MCORE_HIMEM to handle jobs > 2GB core
    
    Redirect to only node with 3GB or more per core
    
    MWT2 has almost 4000 cores which fit this criteria
  - 14:30
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    We have had smooth operations in the past two weeks, with only one brief SRM incident. Running lots of reprocessing (maxrss is already 24000 for our MCORE queues). The main things we are working on are:
    
    o Bringing the new storage online and into GPFS
    
    o Transitioning to HTCondor-CE on the BU and HU side
    
    o Re-formulating our WAN update plan
    
    o Preparing to join LHCONE
    
    o Working with MOC to add a pool of MOC worker nodes
    
    There is also a problem with HU availability reporting via SAM which is broken and still has to be tracked down.
  - 14:35
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - all sites running well
    
    - updated maxrss in AGIS for all sites
    
    - Lucille had a downtime on Monday for OS and OSG (3.3) updates
    
    - hopefully more news about OSG (3.3) testing of the new OSCER cluster soon
    
    - migration of OCHEP cluster from Internet2 to ESnet LHCONE connection under way
  - 14:40
    
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
  - 14:45
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 15:15 → 15:20
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

virtual room

your office