US ATLAS Computing Integration and Operations

US/Eastern
Other Institutes

Other Institutes

Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Michael Ernst, Robert William Gardner Jr (University of Chicago (US))
      Capacities
    • 13:15 13:25
      Production 10m
      Speakers: Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US))
      summary

      Mark reporting 

      • Production has been busy - with many issues owing to Rucio migration, fallout from the accidental DDM deletion, etc. --> failing transfers, etc.
      • Activity is picking up, several sites are full. More jobs than from before.
      • Prodysys1 has been decommissioned. Note change in task IDs with Prodsys2.  Must use bigpanda monitor.
      • See Jamboree last week for tutorial on the monitor.
      • New pilot release from Paul.
      • See links to ADC weeklies for relevant talks.
    • 13:25 13:30
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))

      Armen reporting

      • Main issue has been data loss due to the migration.
      • Initial number was 1M.  Revised to 3.3M lost files. 
      • Thus deletion activities have been limited.  Not all has been understood.
      • Automatic deletions are halted.  Only a low level of deletions are going.
      • Mostly impacting Tier1s.  E.g. 400 TB to be deleted.
      • New model will come into effect early next year.  See related Jamboree notes from last week.

       

    • 13:30 13:35
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))

      Hiro reporting

      • No transfer issues, other than the space issues.
      • FTS at BNL is being primarily used (plus a couple more clouds)
      •  
    • 13:35 13:40
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Shawn reporting

      • Perfsonar sites need to upgrade to 3.4 or better after January 9.
        • Changed the traceroute method (default was B-->A), requires BWCTL to be running.
        • Two fixes: mesh configuration agent (do a forward rather than reverse).
        • Will likely move traceroute tests to go between BW nodes. Done centrally with mesh URL.  
      • SLAC instances need updates.  As does BNL (Hiro is working on it)
      • LHCONE meeting point-to-point circuits NSI (new standard for inter-domain implementations). 
        • few sites now, welcome others to join
        • Goal - demonstrate circuit usage. 
    • 13:40 13:45
      FAX 5m
      Speakers: Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      Ilija reporting

      • Reconciling differences in job efficiency w/ Kaushik
      • Also looking at data from hadoop
      • Will not be expanding overflow jobs until resolved.
      • Next pilot release will fix timeout issue for large files.
    • 13:45 14:45
      Site Reports
      • 13:45
        BNL 5m
        Speaker: Michael Ernst (Unknown)

        Michael

        • BNL networking working with ESnet on finalizing config for transatlantic connectivity
        • Probably joint P2P activity in January
        • Expecting delivery of WNs. 
        • Storage is high on the list.  DDN 2000 drive machine 1.8 PB usable, getting old, failures more frequent.  
        • Thinking also about storage R&D re: storage
        • Talk of increasing ATLAS tape usage; 10,000 slot library.  Volunteered US to work with ADC on the model.
      • 13:50
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        Shawn

        • Open ticket on some step09 files, size mismatch.  Checked - but differing with Rucio. Suspect a casualty of Rucio deployment.  Checksums match!  Saul has observed at other sites, and reported.
        • Getting some equipment (35 Dell R620's, were part of a large order that got cancelled).  256GB memory, dual 10g nics, redundant PS, E5-2670v2 (10C).   
        • Storage MD3460 storage shelf at MSU. UM 600 TB MD3060s, 6TB, Lustre over ZFS. 
      • 13:55
        MWT2 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
        • Connect queues working well (analy and production) to Stampede, HU, ICC, Mietc.
          • well over 1,000 slots between the sites.
        • Procurement in progress.
        • CCC development

         

      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
        • Welcome Dave Caunt.
        • Working on procurement.
        • Nexus 7710 from MIT being setup in Manlan, will be how we peer with LHCONE
        • Very little production.
        • Problem with APF and HU CE nodes.
        • Worldwide FTS performance studies - US performance looks good, except to SARA and NIKKEF 
        • Will be starting CondorCE on BU side
        • ATLAS Connect production
      • 14:05
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        OU is on LHCONE!

        Filled cloud support list comments. 

         

         

         

      • 14:10
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
        • Revamping of internal network has resulted in much better performance.
        • 4032 switch now throwing errors; contacted Dell.  Upgraded firmware, and reboot. Monitoring.
        • Early stages of planning next purchase.
      • 14:15
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:45 14:50
      AOB 5m

      Working on a storage purchase.  Have to make a decision soon.

      HTCondorCE is now working.  Working on the job routing configuration.