US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2017-02-01T13:00:00-05:00
End: 2017-02-01T16:10:00-05:00
Location: No location set

Wednesday 1 Feb 2017, 13:00 → 16:10 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Robert William Gardner Jr (University of Chicago (US))
- 13:15 → 13:20
  
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  The big item being pushed by the ADC now is a switch to the new pilot and Site Mover Configuration. For those sites using their own LSM, as long as that LSM conforms to the "standard" then their should be little to no issue.
  
  Jose has been running a low level of such pilots to all sites for some time and has noted only Lucille has any issues. Horst and Joel are working with Jose to understand this.
  
  With the new mover configuration attributes in PandaQueue such as seprodpath, copyprefixin, copyprefix etc. will not be needed anymore, and will be set to some unusable value to ensure they do not crop up in unexpected ways. Eventually these will be entirely eliminated from AGIS.
  
  There are plenty of events in the MC queue to keep sites busy for some time to come.
  
  @TCB on Monday: reported status of space reporting json. This is less an issue with sites compare to SRM itself. OSG has a pre-beta version of LVS document.
- 13:20 → 13:30
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-1_11_17.pdf
  
  shift-summary-1_18_17.pdf
  
  shift-summary-1_25_17.pdf
  
  shift-summary-2_1_17.pdf
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  170201_DataManagement_Armen.pdf
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:45 → 13:50
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50 → 14:10
  
  Site movers 20m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  USsitesUse_newmover20170131.txt
  
  Ilija/Alexey: US analysis sites with "direct_access_lan" will not move for now.
  
  Ale/Saul: confusion over what LSM refers to: propose to call site level script "LSM", and call the python code in Panda pilot "Site Mover Configuration" or "LSM driver".
  
  David: should check the section on "Associated DDM Storages" to make certain the list of DDM endpoints
  is selected in the correct order for the panda Qs.
  
  Joel/Horst: LUCILLE moved to new movers, see error. Note: LUCILLE_MCORE use xrdcp mover for read, lsm for write. LUCILLE_CE uses xrdcp/lsm for read, lcgcp for write/log. Is this correct?
  
  Jose's summary on site movers and plan on moving APF:
  
  -- we have been running for a long while a special factory setup that uses Alexey's pilot (instead of Paul's pilot) with a new input option -m1. This pilot is, as I understand, using the new movers. Alexey can correct me there if I am wrong. This dedicated setup have been submitting, at a very low rate (average 1 pilot per hour), to every single queue in US ATLAS, including T1 and all T2. If some queue is missing that was just by mistake, not on purpose.
  
  -- As Alexey pointed out earlier, ALMOST all queues in that setup show jobs finished successfully. I understand that means that those sites are ready to work with the new movers.
  
  -- and yes, to make it as clear as possible, lsm is being respected. Nobody is talking about replacing lsm, or changing it, or whatever
  
  -- there is one site that seems to be failing 100% of the times with the
  Alexey's pilot: LUCILLE.
  This needs to be sorted out.
  
  -- I have checked the failure rate of every queue with both pilots
  
  -Alexey's and Paul's- for an entire day of production. Numbers are quite similar. The fact that we are running very few in one case plays against it, so I am not very concerned about the exact differences. The important thing is that they are similar.
  
  -- My plan now was to move an entire factory (currently 1/3 of total production) to run only Alexey's pilot, in order to have a sense of what happen when scaling up. We have never moved anything from 0 to 100% at the factories for US ATLAS.
  If having an entire factory running Alexey's pilot works fine, and failure rate are similar to the other factories, queue by queue, then I planed to switch queues in AGIS. This step assumes that running Alexey's pilot is equivalent to running
  Paul's pilot + AGIS setup for new movers.
  
  -- If sites start playing with AGIS on their own, in parallel, my plan goes to hell. So I am going to leave to sites to decide if they still want me to take care of this or they prefer to run things on their own.
- 14:10 → 14:30
  
  OS performances testing 20m
  
  Speaker: Doug Benjamin (Duke University (US))
- 14:30 → 16:05
  Site Reports
  - 14:30
    
    BNL 5m
    
    Speaker: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR))
  - 14:35
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    AGLT2 is running smoothly.
    
    Sometime next week we will take a short SE outage to upgrade dCache, and to finish off the H330 firmware updates. This is in response to the Dell announcement that firmware prior to 25.5.0.0019 could potentially corrupt the controlled disks. Two dCache pool servers at both sites, R730xd from the most recent Dell purchases, are impacted and need to be updated. All WN updates have been completed.
    
    The dCache update that will simultaneously take place will update from our current 2.13.50 to the 2.16 Golden Release. There is a problem with the 2.13 release in that certain certificates from RU and CA are not supported prior to 2.13.52, the last of the 2.13 release series.
    
    All 2016 funds are spent out. Five new R630 are in production at MSU, and of 4 N2048 switches purchased there, one is in full production, with the remaining 3 set to take the load off the last of the PC6248 at MSU towards the end of February when Mike Nila returns from 2 weeks at CERN. 13 R630 were purchased at UM. Most cabling is in place for these, and the 10Gb network switch (S4048-ON) is received but not yet configured. We hope to bring these into production next week. The single N2048 purchased for UM is in production with 18 32-core WN attached in a bonded 2x1Gb configuration. Kibana plots of stage-in rates shows a clear improvement for these machines over the rate when attached to the PC6248.
    
    With all of the N2048 in place, the PC6248 at AGLT2 will no longer be in use for public NIC connections.
    
    Sometime in the next few weeks we will be upgrading our gatekeepers to the most recent OSG release, 3.3.20 from our current 3.3.18. This should bring into play several HTCondor-CE reporting updates that will soon be required.
    
    The list of affected controllers includes the following RAID controllers: H330, H730, H730P, H830, SD33-2S, SD33-2D
  - 14:40
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site is full of jobs.
    
    Switch problems at UChicago have a number of nodes offline for Atlas jobs
    
    Cooling problem casualty was the old Cisco 6509
    
    Due to stability problems nodes on this switch running opportunistic jobs only - small number of compute nodes affected.
    
    New Juniper top of rack switches are on order
    
    New Purchases
    
    UChicago
    
    26 R430
    
    1040 cores
    
    Installed, waiting for new switches and cables
    
    Indiana
    
    15 R430
    
    600 cores
    
    Illlinois
    
    8 C6320
    
    448 cores
    
    MWT2 Site total will be
    
    18520 cores
    
    192K HS06
    
    Firmware on all Dell nodes upgraded to avoid data corruption
    
    New Movers
    
    All Panda Q for MWT2 and CONECT are now using the "new movers" configuration
    
    MWT2 LSM in use without any changes
    
    CONNECT use "gfal2" mover except Bluewaters which uses lsm
    
    OSG 3.3.20 installed on all gatekeepers
    
    Waiting on instruction from Xin on how to report resources to AGIS
    
    dCache upgraded to 2.13.51
    
    Problem with certs from CA and RU fixed in this point release
    
    Will be looking to move to 2.16 in the near future
    
    MWT2 face to face meeting this week in Urbana
  - 14:45
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    22/23 new worker nodes are installed and in production.
    
    1.5 PB of storage is installed, will begin testing once we fix a bad cable issue.
    
    Will check with Jose and update AGIS config.
    
    Very close to switching over to 100% HTCONDOR on the BU side.
    
    Still need to update the spreadsheets.
    
    Will add JSON following Wei's instructions.
  - 14:50
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - all sites running well
    
    - some issues with new site movers at Lucille, looking into it
    
    - updated to latest OSG 3.3 on OU_OSCER_ATLAS, which caused problems with blahp communicating with SLURM 15.8 again, so backed out of that change again. Working with OSG experts on resolving that.
  - 14:55
    
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
  - 15:00
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 16:05 → 16:10
  
  AOB 5m