Analytics WG meeting
Participants:
Dirk, Christian, Carolina, Raul, Luca, Vag, Rainer, Ulrich, Domenico, Tony, Bernd, Thomas, Valentin, Pedro
Minutes & News:
- The MR/Spark examples from last meeting have been reimplemented in the respective other language to compare performance. Results after hadoop setup is fully operational again.
- Planning creation of a web interface for simple extracts from hadoop awg repository
- Started to create monthly/quarterly report on EOS statistics
- We plan to have a repository/AFS folder to store packages/libraries (eg R, phyton) for common use
- Dirk found an R module for geoip look-up (eg country, city organisation) without using DNS, contact him if interested
- The R-studio web interface on our large memory server is now (partly) functional. Some remaining problems concerning kerberos/AFS reauhentication being discussed with rstudio support.
- CHEP proposal concerning the analysis efforts of IT has been accepted for oral presentation.
Action Item Review
- perfsonar result collection -> proposed date: end Feb
- eos: EOS ops and itmon still iterating on technical approach. new date mid Feb
- batch: not yet a priority for batch team - new date mid-Feb
- xroot data in HDFS: atlas data available, cms expected - date for completing both mid-Feb
- Eos & dashboard data are now exported to hadoop and available in the awgrepo tool (action closed)
- steal time metric has been implemented, next step brief testing
-> new date: mid-Feb
- new action: lemon (subset) data in repo - end Feb
- new action : eos monthly/quarterly summary - end Feb
Hadoop Problems
- Incident on the Hadoop cluster, 364 (out of 780 thousand) files corrupted
- most are flume tmp files
=> itmon will check if flume sucessfully retransmitted
- some already concatenated itmon files lost (lemon/syslog)
=> likely not a problem for analytics as the fraction < 0.4 %%
==> but analysis code needs to be able to handle missing data
=> corruped hdfs files may need to be deleted/moved (after a check)
==> responsibility of data owner!
- Status of root case analysis (slides Rainer)
=> h/w vs s/w problem? support from cloudera?
=> need to understand root cause and alerting thresholds to prevent
future risk and should have follow up in future meeting
In the meantime:
- new hardware (nodes still on way)
- investigating some shorter term HW loans from eos
=> may want to isolate ceilometer hbase (risk of affecting AI)
==> ceilometer data is not directly usable by analytics (Ulrich does extracts)
=> but need to make sure we not prematurely drop raw data which is deemed to be not (ye) useful
- should we separate the raw data area from processed?
=> this could simplify potential backups to castor