Data Knowledge Catalog Meeting (TPU/NRC KI)

Europe/Moscow
Other Institutes

Other Institutes

Vidyo room
Marina Golosova (National Research Centre Kurchatov Institute (RU))
Description
Vidyo room 'DKB'

DKB@ATLAS instance

Current status: OK
New issues:

  • "connection refused" from AMI (daily, for a short period of time);
  • (fixed) special symbols in JSONs (data load to ES).

Apache Airflow: technology overview

The technology gives tools to organize workflows, which are more about sequential tasks than operations on data. Comparing to the DKB's MInT MS (or to what's planned for the future) are:

  • Python wrappers (Operators) for stages <-> Java wrappers (ExternalProcessor and ExternalConnectors);
  • not designed to pass data between stages (tasks);
  • no "stream" processing, only "bulk" one (next task waits for the previous to finish all operations, not a single record or a batch of records processing).

RPM-based DKB (deployment and workflow improvements)

  1. Workflow: replace TravisCI with our own test node (RPMs + containers):
    • unit tests: allow testing for ATLAS-related functionality (that requires being in CERN network and/or authorization);
    • system ("black box") tests: more real-life than those without real ATLAS metadata.
  2. Deployment:
    • scripted scenario for production version upgrade (stop corns, update config files, etc.);
    • more controlled versioning (w/ changelogs etc. instead of Git logs);
    • only really needed files from the repository (instead of full checkout);
    • much simpler instructions for newcomers ;)

Integration w/ BigPanDA Mon

Development status: 75% done (75% in production)

ToDo:

  • update API method to work with different storage scheme (n_events value meaning changed significantly during integration process development).

Batch Processing

  1. First tests show significant improvement in processing time for Stage 095 (AMI).
  2. Notes:
    • for simplicity, messages order in batches won't be kept: those that get "fully processed" earlier, will earlier be sent to output;
    • different scopes -- different requests to AMI (=> 1 batch != 1 request).
  3. ToDo:
    • carefully plan batch mode development (from current "non-supervised" to properly supervised workerk operation);
    • check if AMI has some more convenient response format than the one used now (complicated "technical" JSON).

Dataflow processing logs

  1. Diversification (mark each stage's logs with the stage ID): 50% done.
  2. Known issues:
    • different datetime formats;
    • no "log level" filtering.

DKB installation/development manual

  1. ToDo:
    • pictures for Build a new instance;
    • Introduction;
    • Execution and support (rename?).
There are minutes attached to this event. Show them.
    • 1
      DKB services: status report & news

      Current status of the DKB ATLAS instance.

      Speakers: Marina Golosova (National Research Centre Kurchatov Institute (RU)), Viktor Kotliar (National Research Centre Kurchatov Institute (RU))
    • 2
      Apache AirFlow

      Brief overview of the technology

      Speaker: Viktor Kotliar (National Research Centre Kurchatov Institute (RU))
    • 3
      RPM-based DKB
      Speaker: Viktor Kotliar (National Research Centre Kurchatov Institute (RU))
    • 4
      Other business
      • Pull Requests;
      • 1-slide reports for current tasks;
      • Paperwork;
      • ToDo's;
      • AOB
      Speakers: Anastasiia Kaida (National Research Centre Kurchatov Institute (RU)), Marina Golosova (National Research Centre Kurchatov Institute (RU)), Vasilii Aulov (National Research Centre Kurchatov Institute (RU)), Viktor Kotliar (National Research Centre Kurchatov Institute (RU))