US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-01-06T13:00:00-05:00
End: 2021-01-06T14:45:00-05:00
Location: No location set

Wednesday 6 Jan 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.6 targeted for end of February: https://opensciencegrid.atlassian.net/browse/SOFTWARE-4282 . Highlights include:
  
  Dropping GridFTP and GSI
  
  Upcoming repositories will be available per release series (i.e., GridFTP and XRootD 5 will be available in OSG 3.5)
  
  Starts the timer on OSG 3.5 retirement (targeted for Feb '22)
  
  OSG 3.6 will follow a rolling release model
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 15m
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Minutes
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  
  Eric (Xin had to leave unexpectively)
  
  - Smooth running during the break
  
  - dCache got full and site was black listed for a couple of days around 12/31. Prompt reaction from ADC, deleting data. Full incident analysis on going
  
  - FTS transfer nodes got disturbed by security scan ( GUUS:150057) . Bug reported to dCache,
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Minutes
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  Holiday-2020-Njobs.png
  
  MWT2_aipanda_pie.png
  Relatively quiet period for the tier 2 sites over the holiday:
  
  The two features are caused by:
  
  Restarting a VRF on an enterprise network switch at IU caused the spike of jobs on 12/29.
  
  The drop in running jobs over the first weekend in 2021 was by the harvester instance on aipanda158 having troubles so jobs were not submitted quickly enough to keep up with demand. Kudos to Rod and FaHui who understood and fixed this over a holiday weekend.
  ==> It seems to me that the US may be getting to big for the current submission infrastructure.
  UIUC will another ~3k cores online in the next few weeks which won't help...
  
  Just before the holiday period, I believe I extracted enough information from Rod Walker to understand which aipanda hosts service a given site and if the harvester instances are keeping up with the number of unused (idle state) cores that they get refilled quickly. Using this example URL which has a filter that selects the aipanda VMs that service MWT2 (change the filter for your site):
  
  https://es-atlas.cern.ch/kibana/app/kibana#/dashboard/a312a030-8b0e-11e8-a7e3-ffbb2f24f6b4?_g=(filters:!(),refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'006980c0-857a-11ea-9233-1dd73e396ea6',key:computingsite,negate:!f,params:(query:MWT2),type:phrase,value:MWT2),query:(match:(computingsite:(query:MWT2,type:phrase))))),fullScreenMode:!f,options:(darkTheme:!f,hidePanelTitles:!f,useMargins:!t),panels:!((gridData:(h:12,i:'2',w:24,x:0,y:0),id:'263969e0-8b0e-11e8-a7e3-ffbb2f24f6b4',panelIndex:'2',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'3',w:24,x:0,y:24),id:'50bf08a0-8b0e-11e8-a7e3-ffbb2f24f6b4',panelIndex:'3',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'5',w:24,x:0,y:48),id:a9207bf0-843d-11e8-a7e3-ffbb2f24f6b4,panelIndex:'5',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'6',w:24,x:24,y:48),id:'6a9a1ee0-bb60-11e8-b0c4-a33eb2fc4911',panelIndex:'6',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'7',w:24,x:0,y:64),id:d5fec150-bb5f-11e8-b0c4-a33eb2fc4911,panelIndex:'7',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'8',w:24,x:24,y:64),id:fd3c1d70-bb60-11e8-b0c4-a33eb2fc4911,panelIndex:'8',type:visualization,version:'7.1.1'),(gridData:(h:28,i:'9',w:48,x:0,y:104),id:ed01a550-bb61-11e8-b0c4-a33eb2fc4911,panelIndex:'9',type:search,version:'7.1.1'),(gridData:(h:12,i:'10',w:24,x:0,y:36),id:ac172890-bb65-11e8-a7e3-ffbb2f24f6b4,panelIndex:'10',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'11',w:24,x:24,y:0),id:'5a658b00-bbe1-11e8-a7e3-ffbb2f24f6b4',panelIndex:'11',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'12',w:24,x:24,y:36),id:'92c0b300-bbe4-11e8-a7e3-ffbb2f24f6b4',panelIndex:'12',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'14',w:24,x:0,y:12),id:'970d6a40-bcc8-11e8-b0c4-a33eb2fc4911',panelIndex:'14',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'15',w:24,x:0,y:80),id:a7b38f30-cd5f-11e8-9c62-c3475f7c9464,panelIndex:'15',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'16',w:24,x:24,y:80),id:'6d062730-fa69-11e8-92bb-b7deb199b33d',panelIndex:'16',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'17',w:24,x:0,y:92),id:'06fbe480-fc98-11e8-92bb-b7deb199b33d',panelIndex:'17',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'18',w:24,x:24,y:24),id:'4831da90-3874-11ea-a9fb-579ba6ca4985',panelIndex:'18',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'19',w:24,x:24,y:12),id:be240500-0793-11ea-bd9d-f99303ede076,panelIndex:'19',type:visualization,version:'7.1.1')),query:(language:lucene,query:''),timeRestore:!t,title:'Harvester%20particular%20computingsite',viewMode:view)
  
  Look at the complex pie chart labeled harvester_submissionhost_pie:
  
  The doughnut hole has the proportion of job submitted by each aipanda host, the inner doughnut (toroid) has the states of the cores at MWT2, and the outer doughnut has the states of the jobs on the submission. There have to be enough "submitted" jobs on the harvester side to keep the number of idle cores below 2-3%.
  
  NB: If there is a problem there are no/few submitted jobs and that slice of the outer doughnut disappears. You will see an increasing percentage of idle jobs in the inner doughnut.
  
  NB: Hovering over the entries on the legend highlights the corresponding slices making it easier to see what is going on.
  
  There is probably a much better way to do this. It was a little hard for me to understand from what Rod told me exactly what to do and this is just my best guess from the hints given. I consulted with Ofer and he is looking into it.
  - 13:40
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    no ggus tickets
    
    smooth operation during the holidays, job draining between 1st and 3rd Jan, it started to ramp up at midnight of 3rd, other than that, condor cluster usage remained high (>96%)
    
    about 15 work nodes lacked a squashfs rpm, which led to boinc jobs failure, reinstalled the rpm and re-enabled boinc on the nodes.
    
    only one notable hardware incident. One R740XD2 crashed with complex multiple symptoms. PCIe error from NIC, memory bit error rate, iDRAC not reachable over http. Solution involved swapping 2 DIMMs (maybe not the root cause nor root solution) and updating all FW, especially BIOS and NIC. Dell is recommending updating BIOS (R740xD2 to >=2.8.2)
  - 13:45
    
    MWT2 5m
    
    Minutes
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Changes in FTS configuration for MWT2_DATADISK upstream broke our WebDAV door just before the break. Ticket on hold until Alessandra and Paul have time to help debug
    
    IU Brocade started having problems again over the holiday break. GGUS ticket now closed
    
    New UIUC team member coming on board
  - 13:50
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations over the break.
    
    Site full except for the dip experienced at all sites.
    
    Ongoing work on OSG 3.5, Xrootd endpoints, NESE Tape
  - 13:55
    
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Mostly quiet during the break.
    
    Updated gridftp servers to OSG 3.5, Squid servers will be done later today or tomorrow.
    
    OU:
    
    Nothing to report, everything running well.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Doug Benjamin (Duke University (US)), lincoln bryant
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    Analysis Facilities - Chicago 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    quiet time... no requests over the break. not sure if everything worked or people did want to disturb us.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Minutes
  
  Convener: Ofer Rind
  Fairly quiet over the break as expected.
  
  Harvester node submission problem - looking with Fred and Horst at best way to detect such issues
  
  HTTP-TPC transfer issues at MWT2 under investigation
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-12_16_20.pdf
    
    US-cloud-summary-12_23_20.pdf
    
    US-cloud-summary-12_30_20.pdf
    
    US-cloud-summary-1_6_21.pdf
  - 14:25
    
    Service Development & Deployment 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    All XCaches were working fine.
    
    VP was working fine.
    
    ES was fine.
    
    Working on XCache 5.1.0 rc4 image.
    
    Adding heartbeat to both XCache and VP.
  - 14:30
    
    Kubernetes R&D at UTA 5m
    
    Minutes
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Continuing to gain experience with the UTA K8S test cluster, to understand better the environment, and looking into the setup of interfacing it with the ATLAS workload management system.
    
    Meanwhile increased the cluster size to total 224 cores of workers, and upgraded it to the K8S latest version 1.20.1 .
    
    Next step try to setup and run with ATLAS jobs.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Share this page

Direct link

Social networks

Calendaring