# WLCG Site Monitoring Migration MONIT team - 07.05.2020 --- ## Overview * Proposal to move WLCG Site Monitoring (SAM3) to MONIT * ==No changes== from WLCG Management / Site Admins perspective * Same ETF test and Alice tests * Same profiles (aggregagation logic) * Same PDF report output * Just a different infrastructure handling the data --- ## Old SAM3 Dashboards * Based on old dashboards infrastructure * Test results from ETF + Alice * Enriched using VOFeed data * Dashboards * One dedicated instance per vo * Latest view: last test results view * Historical view: site and service availability, tests history * Reports * Generated monthly and send to WLCG * All_Sites, Tier1_History, Tier1_Summary, Tier1_VO --- ## New MONIT Site Monitoring * Based on MONIT infrastructure * Test results from ETF + Alice * Enriched using VOFeed data * Dashboards * In MONIT Grafana WLCG organization * Latest view * Historical view * Reports * Generated monthly and send to WLCG * All_Sites, Tier1_History, Tier1_Summary, Tier1_VO --- ## New Homepage :::info http://cern.ch/monit-wlcg-sitemon ::: ![](https://codimd.web.cern.ch/uploads/upload_ba90805c1a0a4fdaea3e17bb90fbc6e7.png) --- ## Dashboards * Available from MONIT Grafana WLCG organization * Latest and Historical views as before * Working on adding info about recomputation requests * Proposal for data retention (Still being agreed): * 1 year for site/service availability from dashboards * but PDFs report kept forever * 1 year for raw test data from dashboards * but HDFS archival for several years --- ## Dashboards ![](https://codimd.web.cern.ch/uploads/upload_a507555b149554f88d8269ab8008c655.png) --- ## Monthly Reports (I) * Output format * Today we have: PDF, JSON, CSV, HTML * Is the HTML output still needed? (First feedback was it's not) * Federation availability * In the old infrastructure federations with multiple sites ignore the ones without data * Leading to 100% federation availability even if one site was not available at all --- ## Monthly Reports (II) * Unkown status * In the old infrastructure is computed on top of OK * Leading to availabilities of >100% * No ETF data * In the old infrastructure is replaced by OK * Sites whitout testing for a while showing close to 100% availability * All this can be solved by issuing recomputation requests --- ## Profiles Definition * Managed internally by the MONIT team * Only the VO critical profiles available so far > ALICE_CRITICAL > ATLAS_CRITICAL > CMS_CRITICAL > LHCB_CRITICAL * Good opportunity to clean legacy profiles * Please open a SNOW ticket to request missing profiles --- ## Recomputation Requests * Managed by Experiment representatives * Based on gitlab, one simple json doc per request * Built-in tracking of requests history * Detailed documentation provided in the repository ```json { "dst_site": "T2-BR_SPRACE", "periods": [ { "start_time": "2020-01-01 00:00:00", "end_time": "2020-01-06 20:00:00", "status": "OK", } ] "vo": "cms" } ``` --- ## Current Status * :white_check_mark: Test results and downtimes integrated in MONIT * :white_check_mark: Availability and reliability computed per service and site * :white_check_mark: Equivalent Grafana dashboards available in WLCG org * :white_check_mark: Exact same PDF reports generated * :white_check_mark: Data validated against the old infrastructure * :large_orange_diamond: Add recomputation requests info in dashboards * :large_orange_diamond: Stop old dashboards and infrastructure --- ## Migration Plan * May: * 01: New dashboards available for testing/feedback * June: * 01: May draft reports from **Old** and **New** infrastructure * 16: May final reports from **Old** infrastructure * July * 01: June draft reports from **Old** and **New** infrastructure * 16: June final reports from **New** infrastructure * 31: Stop old dashboards but keep infrastructure running * August: * 31: Retire the old infrastructure (dashboards and reports) --- ## Next Steps * From MONIT Team * Add recomputation requests info in dashboards * User support on provided feedback * From WLCG/Experiments * Validate new reports for May and June * Provide feedback on dashboards, reports, and tools * Ask for profiles that might be missing --- ## Thank You http://cern.ch/monit-support ---
{"type":"slide","slideOptions":{"theme":"cern6"}}