US ATLAS Computing Integration and Operations
-
-
13:00
→
13:05
Top of the Meeting 5mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
-
13:05
→
13:10
HPC integration 5mSpeaker: Doug Benjamin (Duke University (US))
- 13:10 → 13:15
-
13:20
→
13:25
Production 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:25
→
13:35
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
-
13:35
→
13:40
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
The dumps for BNL were provided by Hiro. DDMops jira opened: https://its.cern.ch/jira/browse/ATLDDMOPS-5465 . Consistency checks start running last week, found 1M dark files on the DATADISK and 120k files on the SCRATCHDISK. After deletions, looks like still significant leftover, which could be a reporting issue or not reported usage.
Dark Data situation (numbers with "-" mean storage reports less than size in rucio):
Site DATADISK SCRATCHDISK LOCALGROUPDISK
BNL 390 110 16
AGLT2 9 1 1
MWT2 9 8 2
NET2 12 5 0
OU_OSCER -185 -2 0
SWT2 3 1 0
WT2 -81 -5 1
IT monitoring team is still working to repopulate the missing data in DDM Accounting dashboard (the SNOW ticket I opened 2 weeks ago INC1705039). Right now the storage values for 8 recent days are still missing.
Hands-on session on the new DDM dashboard, with possibility to give feedback on issues with it we would like to be addressed, Friday. Nov.8 at 15:00 CET.
-
13:45
→
13:50
Networking 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
Lots of meetings:
- LHCONE/LHCOPN https://indico.cern.ch/event/725706/
- IRIS-HEP https://indico.cern.ch/event/755573/timetable/
- OSG Retreat https://indico.fnal.gov/event/18117/timetable/#20181107
All of these have network discussions and presentations. In addition there was a perfSONAR face-to-face meeting in Orlando two weeks ago (still no URL for presentations)
There are some known issues with MaDDash, causing our meshes to appear to have less data than they actually do. Working on getting fixes into the perfSONAR developers timeline.
Looking for input and collaboration on the HEPiX network function virtualization effort. See details in presentation at: https://indico.cern.ch/event/725706/contributions/3169183/attachments/1744902/2824548/HEPiX_Network_Functions_Virtualisation_Working_Group_F2F_Meeting.pdf
Shawn
-
13:50
→
13:55
Data delivery and analytics 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
13:55
→
14:30
Site Reports
- 13:55
-
14:00
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
The Postgresql slave server for the dCache head node lost its hard driver, we have to rebuilt the machine with 2 new hard disks, and we are in the process of restoring the slave database, and setting up the SRMwatch for dcache. We also take this chance to upgrade this host from SLC6 to CentOS 7 and use ZFS to host the Postgresql database.
Some of our blade work nodes have unusual high load (over 200) without any jobs running. And the load goes down when we turn off HTCondor on the work node. Some of the work nodes have disk issues, some of them do not. We can not understand the situation, and we updated a few work nodes to 8.6.12 from 8.4.11 , and give Brian Lin access to these work nodes to debug.
Ref ticket:
https://support.opensciencegrid.org/helpdesk/tickets/7720
We upgraded dCache from version 4.2.6 to 4.2.14. And we took the chance to upgrade the OS and firmware too. Upgrade on some of the MSU pool nodes did not get well. After upgrading the dCache rpms, the dCache service wouldnot start on the pool nodes, the head node thought there was already pool instance running on the pool node and there are lock files in the dCache pool. We tried various things, including updating/rebooting zookeepers, restarting dcache services on head/door nodes, what eventually fixed the problem was to reinstall the dcache rpm on the pool nodes.
When we trying to retiring one storage shelf from one of the UM dcache pool node, the wrong virtual disk was accidentaly deleted, we couldn't manage to recover the vdisk,hence lost over 85K files. for this we opened a JIRA ticket and reported the lost files.
-Wenjing
-
14:05
MWT2 5mSpeaker: Judith Lorraine Stephen (University of Chicago (US))
-
14:10
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
-
14:15
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
Apparently OU compute nodes are failing over from OU Frontier Squid to others, we are investigating. Increased squid cache to 100 GB, and don't see any obvious errors, but failovers continue.
There was a network issue at OneNet in Tulsa, which caused transfer failures and slower transfer speeds, starting Nov 2, but it was fixed last night, everything back to normal now.
UTA:
DMZ rework is now complete.
-
14:30
→
14:35
AOB 5m
-
13:00
→
13:05