US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.6 targeted for end of February: https://opensciencegrid.atlassian.net/browse/SOFTWARE-4282 . Highlights include:

      • Dropping GridFTP and GSI
      • Upcoming repositories will be available per release series (i.e., GridFTP and XRootD 5 will be available in OSG 3.5)
      • Starts the timer on OSG 3.5 retirement (targeted for Feb '22)
      • OSG 3.6 will follow a rolling release model
    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 15m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))

      Eric (Xin had to leave unexpectively)

      - Smooth running during the break

      - dCache got full and site was black listed for a couple of days around 12/31. Prompt reaction from ADC, deleting data. Full incident analysis on going

      - FTS transfer nodes got disturbed by security scan ( GUUS:150057) .  Bug reported to dCache,

    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))

       

      • Relatively quiet period for the tier 2 sites over the holiday:

        The two features are caused by:
        • Restarting a VRF on an enterprise network switch at IU caused the spike of jobs on 12/29.
        • The drop in running jobs over the first weekend in 2021 was by the harvester instance on aipanda158 having troubles so jobs were not submitted quickly enough to keep up with demand. Kudos to Rod and FaHui who understood and fixed this over a holiday weekend.
             ==> It seems to me that the US may be getting to big for the current submission infrastructure.
          UIUC will another ~3k cores online in the next few weeks which won't help...
      • Just before the holiday period, I believe I extracted enough information from Rod Walker to understand which aipanda hosts service a given site and if the harvester instances are keeping up with the number of unused (idle state) cores that they get refilled quickly. Using this example URL which has a filter that selects the aipanda VMs that service MWT2 (change the filter for your site):

        https://es-atlas.cern.ch/kibana/app/kibana#/dashboard/a312a030-8b0e-11e8-a7e3-ffbb2f24f6b4?_g=(filters:!(),refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'006980c0-857a-11ea-9233-1dd73e396ea6',key:computingsite,negate:!f,params:(query:MWT2),type:phrase,value:MWT2),query:(match:(computingsite:(query:MWT2,type:phrase))))),fullScreenMode:!f,options:(darkTheme:!f,hidePanelTitles:!f,useMargins:!t),panels:!((gridData:(h:12,i:'2',w:24,x:0,y:0),id:'263969e0-8b0e-11e8-a7e3-ffbb2f24f6b4',panelIndex:'2',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'3',w:24,x:0,y:24),id:'50bf08a0-8b0e-11e8-a7e3-ffbb2f24f6b4',panelIndex:'3',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'5',w:24,x:0,y:48),id:a9207bf0-843d-11e8-a7e3-ffbb2f24f6b4,panelIndex:'5',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'6',w:24,x:24,y:48),id:'6a9a1ee0-bb60-11e8-b0c4-a33eb2fc4911',panelIndex:'6',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'7',w:24,x:0,y:64),id:d5fec150-bb5f-11e8-b0c4-a33eb2fc4911,panelIndex:'7',type:visualization,version:'7.1.1'),(gridData:(h:16,i:'8',w:24,x:24,y:64),id:fd3c1d70-bb60-11e8-b0c4-a33eb2fc4911,panelIndex:'8',type:visualization,version:'7.1.1'),(gridData:(h:28,i:'9',w:48,x:0,y:104),id:ed01a550-bb61-11e8-b0c4-a33eb2fc4911,panelIndex:'9',type:search,version:'7.1.1'),(gridData:(h:12,i:'10',w:24,x:0,y:36),id:ac172890-bb65-11e8-a7e3-ffbb2f24f6b4,panelIndex:'10',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'11',w:24,x:24,y:0),id:'5a658b00-bbe1-11e8-a7e3-ffbb2f24f6b4',panelIndex:'11',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'12',w:24,x:24,y:36),id:'92c0b300-bbe4-11e8-a7e3-ffbb2f24f6b4',panelIndex:'12',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'14',w:24,x:0,y:12),id:'970d6a40-bcc8-11e8-b0c4-a33eb2fc4911',panelIndex:'14',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'15',w:24,x:0,y:80),id:a7b38f30-cd5f-11e8-9c62-c3475f7c9464,panelIndex:'15',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'16',w:24,x:24,y:80),id:'6d062730-fa69-11e8-92bb-b7deb199b33d',panelIndex:'16',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'17',w:24,x:0,y:92),id:'06fbe480-fc98-11e8-92bb-b7deb199b33d',panelIndex:'17',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'18',w:24,x:24,y:24),id:'4831da90-3874-11ea-a9fb-579ba6ca4985',panelIndex:'18',type:visualization,version:'7.1.1'),(gridData:(h:12,i:'19',w:24,x:24,y:12),id:be240500-0793-11ea-bd9d-f99303ede076,panelIndex:'19',type:visualization,version:'7.1.1')),query:(language:lucene,query:''),timeRestore:!t,title:'Harvester%20particular%20computingsite',viewMode:view)

        Look at the complex pie chart labeled harvester_submissionhost_pie:



        The doughnut hole has the proportion of job submitted by each aipanda host, the inner doughnut (toroid) has the states of the cores at MWT2, and the outer doughnut has the states of the jobs on the submission. There have to be enough "submitted" jobs on the harvester side to keep the number of idle cores below 2-3%.
        • NB: If there is a problem there are no/few submitted jobs and that slice of the outer doughnut disappears. You will see an increasing percentage of idle jobs in the inner doughnut.
        • NB: Hovering over the entries on the legend highlights the corresponding slices making it easier to see what is going on.
      • There is probably a much better way to do this. It was a little hard for me to understand from what Rod told me exactly what to do and this is just my best guess from the hints given. I consulted with Ofer and he is looking into it.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        no ggus tickets

        smooth operation during the holidays, job draining between 1st and 3rd Jan, it started to ramp up at midnight of 3rd, other than that, condor cluster usage remained high (>96%)

        about 15 work nodes lacked a squashfs rpm, which led to boinc jobs failure, reinstalled the rpm and re-enabled boinc on the nodes. 

        only one notable hardware incident.  One R740XD2 crashed with complex multiple symptoms. PCIe error from NIC, memory bit error rate, iDRAC not reachable over http.  Solution involved swapping 2 DIMMs (maybe not the root cause nor root solution) and updating all FW, especially BIOS and NIC.  Dell is recommending updating BIOS (R740xD2 to >=2.8.2)

         

         

         

         
         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Changes in FTS configuration for MWT2_DATADISK upstream broke our WebDAV door just before the break. Ticket on hold until Alessandra and Paul have time to help debug

        IU Brocade started having problems again over the holiday break. GGUS ticket now closed

        New UIUC team member coming on board

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Smooth operations over the break.

        Site full except for the dip experienced at all sites.

        Ongoing work on OSG 3.5, Xrootd endpoints, NESE Tape

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        Mostly quiet during the break.

        Updated gridftp servers to OSG 3.5, Squid servers will be done later today or tomorrow.

         

        OU:

        Nothing to report, everything running well.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), lincoln bryant
    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        quiet time... no requests over the break. not sure if everything worked or people did want to disturb us.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind

      Fairly quiet over the break as expected.

      • Harvester node submission problem - looking with Fred and Horst at best way to detect such issues
      • HTTP-TPC transfer issues at MWT2 under investigation
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        All XCaches were working fine.

        VP was working fine. 

        ES was fine. 

        Working on XCache 5.1.0 rc4 image.

        Adding heartbeat to both XCache and VP.

      • 14:30
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Continuing to gain experience with the UTA K8S test cluster, to understand better the environment, and looking into the setup of interfacing it with the ATLAS workload management system.

        Meanwhile increased the cluster size to total 224 cores of workers, and upgraded it to the K8S latest version 1.20.1 .

        Next step try to setup and run with ATLAS jobs.

    • 14:40 14:45
      AOB 5m