- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
From Wei:
Vincent Garonne moved back to Oslo, still working on DDM. Mario Lassnig is in charge of DDM. Martin Barisits in charge of RUCIO.
From Bob:
Concrete plan nearly in place for Implementing WLCG diskless sites for production. Would utilize storage at "nearby" T2 site. See: https://indico.cern.ch/event/642836/contributions/2608398/attachments/1467335/2268911/Diskless_28May.pdf
North American Throughput Meeting
============================
31-May-2017, 10-11 AM Eastern
Attending: Dave, Ilija, Shawn, Marian, Phillipe, Saul, Duncan, Garhan, Andy
https://indico.cern.ch/event/640627/
perfSONAR v4.0
- Update progress and issues
Shawn reported on OSG networking upgrades and data loss
Network Measurement Platform status and updates
Marian reported on ps_etf and meshconfig.grid.iu.edu. Review of the services monitored (https://etf-ps.cern.ch/etf/check_mk)
Update on Analytics
Ilija reported on work to find changes in packet-loss, throughput, etc. See paper https://arxiv.org/pdf/1508.01280.pdf
Trying this method on CERN-BNL link analysis. Machine-learning also being tried on perfSONAR data to find anomalies in
our data. (Someone working on Titan...need details)
Round-table
Saul mentioned that MGHPP is down for maintenance and this was an opportunity to go to 100G. When site is back up it will be
using 100G. Shawn asked about using that path for LHCONE; Saul: yes, should be used.
Andy: minor update to pScheduler in the next few days (intermittent lock-up fix). IPv6 may be having some issues.
Marian: Question about Docker support for the full toolkit? Andy: being discussed if this will be done at next week's
face-to-face in Ann Arbor.
Lots of Q&A and account setup for meshconfig.
AOB and next meeting
Demo of OpenvSwitch / OpenFlow + OpenStack for our next meeting.
Watch email for next meeting date.
From Wei and Andy:
Inverse RUCIO Name2Name component is ready. It is a plugin that identify file replicas at different sites as the same file and thus improves Xrootd proxy cache's hitting rate. It requires support from Xrootd release 4.7, which will be ready soon.
Working with RUCIO team to report back Proxy cache's contents --- in progress.
All completed except BNL_LOCAL-condor, which needs to set “deprecate_oldmover = True”. According to Xin:
We can't change it for BNL_LOCAL-condor for the time being, as pilot
running ES jobs there isn't ready for it. It's said the patch is already
in, we can do the switch after it's released to production.
I guess we don't need this item in the agenda in the future.
Over the Memorial Day weekend dCache suddenly "crashed". Everything looked "normal" but writes were timing out. Tracked to "too many locks", and cleared by doing a vacuum on all postgres DBs. A secondary issue then asserted, where the dCacheDomain was running out of memory (at 2g). Increased to 3g and this problem resolved. We have been running stably since that time.
Reminder that we will be down for a power outage from Noon on Friday, June 23 until sometime Monday June 26 when all services can be restarted. We will do some software updates and dCache maintenance during this period.
Site is now full of jobs and operating well
Updated site to OSG 3.3.24
Retirement of several storage nodes in dCache
Illinois Campus Cluster lost hypervisors
USERDISK down to only 12TB in use
SCRATCHDISK deletion still is issue
We had the annual 1 day MGHPCC-wide power shutdown last week. Notable improvements made:
1. Migrated NSF to new servers (mostly a Tier 3 issue)
2. 100G WAN gear was installed and configured. Use of 100G only waits for NoX to switch us over.
3. USERDISK is almost empty according to plan. Moved storage to other tokens as requested by Armen.
4. Lots of NESE activity. CEPH cluster made from Harvard contributed equipment as a test ATLAS DDM endpoint.
Smooth operations with only minor problems. High level of LIGO jobs for a few days.
-nothing to report, all sites are running well
- we're seeing some lost heartbeat jobs, but we believe they are not site related, since we're seeing them at multiple sites, and BU is seeing them as well (right now, Tuesday afternoon), and in the past we've never been able to find a local source for them, and believe they're panda related