HTTP Deployment Task Force

Europe/Zurich
28/R-015 (CERN)

28/R-015

CERN

3
Show room on map
Description

AGENDA

MEETING OBJECTIVE - DEFINE STEPS REQUIRED IN ORDER TO BEGIN CONTACTING SITES

  • Demo of SAM/Nagios test instance and new HTTP probe
  • Remaining steps for the testing system
    • Automatic integration of topology with SAM/Nagios
    • Opening the firewall
    • Feedback on the probe
      • Functional completeness
        • Ancillary tests for operational debugging
        • Multi-range requests ("vector reads")
      • Classification criteria (OK, CRITICAL, WARNING, UNKNOWN)
        • Timeout values, warning thresholds (currently 10s)
    • SAM & visualisation
      • What is needed here?
      • What historical info is required?
      • Allow probe resubmission by sites?
  • Site-oriented documentation
    • explanation of TF
    • statement from experiments about priority and criticality
    • explanation of tests
    • links to docs from the storage providers
    • contact point for questions
    • faq/knowledgebase
    • instructions for running probe locally
  • Ticketing system - GGUS support unit?
  • Validate system using some volunteer sites
    • Can T1s suggest some names?

IF THERE IS TIME REMAINING:

  • Review of outstanding issues in the "How to support HTTP for WLCG" document, esp pertaining to the probe
    • https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFStorageRecommendations
  • Transfer monitoring - feedback from meeting 15th July

ADDITIONAL MATERIAL

 

HTTP Deployment Task Force Minutes

Wednesday, 7 October 2015 from 16:00 to 18:00 (Europe/Zurich)

 

Present

Enrico Vianello
Dave Dykstra
Hung-Te Lee
Xavier Mol
Georgios Bitzes
Sam Skipsey
Mario Lassnig
Oliver Keeble
Fabrizio Furano
Cédric Serfon
Christophe Haen
Marian Babik
Andrea Ceccanti

Actions

Georgios – Doc on how to install and run locally.
           Probe Updates
			increased verbosity
			failed-PUT logic

Oliver – GGUS support unit

Cédric – Topology feed → Marian

Marian – Move to pre-production and opening of the firewall

Christophe – LHCb statement for sites

Cédric – Atlas statement for sites

Sam - Contact GridPP sites and request validation volunteers

Summary

> Automatic integration of topology with SAM/Nagios

Already agreed with Stefan for LHCb - VO feed will be used

Atlas – Cédric to pass a code snippet to Marian which allows extraction of the relevant list from AGIS.

 

> Opening the firewall

This was not discussed but I add further info here. As this is a sensitive service (scheduling jobs with powerful credentials) the firewall will only be opened when we go to pre-production. This requires a security audit, puppetisation of the service, some scale testing and valid topology feeds from the VOs.

 

> Functional completeness of probe & ancillary tests for operational debugging

The principle to be adopted is to make the probe as verbose as possible 
right now and then to trim if necessary. Put full request and response headers.

A test to show SSL info would be good for debugging purposes.

 

> Multi-range requests ("vector reads")

Unnecessary to test at this stage.

 

> Classification criteria (OK, CRITICAL, WARNING, UNKNOWN), Timeout values, warning thresholds (currently 10s)

All OK here.

 

> SAM & visualisation

SAM will provide "standard" visualisation in the future, but the 
check_mk interface will remain and is frequently the preferred path for debugging. Check_mk is now preview but will be puppetised and promoted to pre-production. Then we will get a SAM3 visualisation via the "preprod SAM3"

 

> Allow probe resubmission by sites?

Yes, this is and will remain possible.

 

> Site-oriented documentation

Agreed that the probe results should incorporate a link to a TF FAQ to help sites understand what is expected and assist in fixing problems.

This doc should contain:

        explanation of TF
        statement from experiments about priority and criticality
        explanation of tests
        links to docs from the storage providers
        contact point for questions
        faq/knowledgebase
        instructions for running probe locally

Statements from the Atlas and LHCb will be supplied.

 

> Ticketing system - GGUS support unit?

Agreed to create a GGUS Support Unit to handle communication with sites. Action Oliver.

 

> Validate system using some volunteer sites

KIT volunteered
Sam to put the proposal to GridPP storage folks.
Enrico will test the script against storm

 

> Code Review

A code review of the probe is required because the proxy used by SAM is a powerful one. To be completed before any production use.

 

> What to do if PUT fails

Currently the probe's first action is to PUT a file. If this fails, the testing is abandoned and all other test results are UNKNOWN. This behaviour should change so sites whose data is readable over HTTP are visible. Thus if PUT fails the probe should first check for the existence of a standard file in the root of the VO space called 

"<VO_IN_CAPS>_HTTPTFtest.txt"

and if that file is absent attempt to put a test file via SRM. 

Note - subsequent discussion indicated that doing a "stat" and "ls" on the top level directory may also be a good solution.

 

> Review of outstanding issues in the "How to support HTTP for WLCG" document

Andrea Ceccanti noted that as StoRM does not yet support 3rd party copy with HTTP it can be discussed in a future TF meeting if the experiments wish.

 

> Transfer monitoring - feedback from meeting 15th July

Andrea Ceccanti reported substantial agreement from the StoRM perspective on the proposed monitoring data path: udp→collector→dashboard. 
There are minutes attached to this event. Show them.
The agenda of this meeting is empty