Analytics WG meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map

Participants:

Dirk, Christian, Carolina, Raul, Luca, Vag, Rainer, Ulrich, Domenico, Tony, Bernd, Thomas, Valentin, Pedro

Minutes & News:

- The MR/Spark examples from last meeting have been reimplemented in the respective other language to compare performance. Results after hadoop setup is fully operational again.

- Planning creation of a  web interface for simple extracts from hadoop awg repository

- Started to create monthly/quarterly report on EOS statistics

- We plan to have a repository/AFS folder to store packages/libraries (eg R, phyton) for common use

- Dirk found an R module for geoip look-up (eg country, city organisation) without using DNS, contact him if interested

- The R-studio web interface on our large memory server is now (partly) functional. Some remaining problems concerning kerberos/AFS reauhentication being discussed with rstudio support.  

- CHEP proposal concerning the analysis efforts of IT has been accepted for oral presentation.

Action Item Review

  - perfsonar result collection -> proposed date: end Feb
  - eos: EOS ops and itmon still iterating on technical approach. new date mid Feb
  - batch: not yet a priority for batch team - new date mid-Feb
  - xroot data in HDFS: atlas data available, cms expected - date for completing both mid-Feb
  - Eos & dashboard data are now exported to hadoop and available in the awgrepo tool (action closed)
  - steal time metric has been implemented, next step brief testing
     -> new date: mid-Feb
  - new action: lemon (subset) data in repo - end Feb
  - new action : eos monthly/quarterly summary - end Feb

Hadoop Problems

- Incident on the Hadoop cluster, 364 (out of 780 thousand) files corrupted
   - most are flume tmp files
   => itmon will check if flume sucessfully retransmitted
   - some already concatenated itmon files lost (lemon/syslog)
      => likely not a problem for analytics as the fraction < 0.4 %% 
      ==> but analysis code needs to be able to handle missing data
      => corruped hdfs files may need to be deleted/moved (after a check) 
      ==> responsibility of data owner! 

  - Status of root case analysis (slides Rainer)
    => h/w vs s/w problem? support from cloudera?
    => need to understand root cause and alerting thresholds to prevent
       future risk and should have follow up in future meeting

  In the meantime:
  - new hardware (nodes still on way)
  - investigating some shorter term HW loans from eos
    => may want to isolate ceilometer hbase (risk of affecting AI)
    ==> ceilometer data is not directly usable by analytics (Ulrich does extracts)
    => but need to make sure we not prematurely drop raw data which is deemed to be not (ye) useful

  - should we separate the raw data area from processed?
    => this could simplify potential backups to castor

 

There are minutes attached to this event. Show them.
    • 14:00 14:05
      Minutes & News 5m
      Speaker: Dirk Duellmann (CERN)
    • 14:05 14:15
      Review of open action items 10m
      list of actions
    • 14:15 14:25
      Hadoop update 10m
      Speaker: Rainer Toebbicke (CERN)
      Slides
    • 14:25 14:55
      Pilot project for CMS computing data-mining 30m
      Speaker: Valentin Y Kuznetsov (Cornell University (US))
      Slides