Batch Operations Weekly

31/S-023 (CERN)



Show room on map
Videoconference Rooms
Gavin McCance
Auto-join URL
Useful links
Phone numbers
    • 14:00 15:00
      Agenda 1h
      • [Luis] Discuss the enforcement of code-review in our repositories/infra.
        • Traceability?: why setting X is applied in hostgroup A?
        • Quality?: why these sub hostgroups are a copy/paste of that one?
        • Standards?: why isn’t this patch merged into master after 6 months?
        • Knowledge sharing?: avoid “golden boy” anti-pattern.
        • Technical debt!
        • Related pointers: CODEOWNERS file (useful in bi).
        • CI/CD Testing?: offloading quality/functionality checks to automated pipelines for container images and helm charts.
        • Needs process for out of band changes.
        • Is there a threshold consideration
        • Agreed: MRs come with require approver. Can uncheck for emergency (can the runner notify on that?). We need a bot for open reqs. Need agreement with HPCers for bi. Protect QA & Master. Features should be tickets if you can't describe them in basically the commit message.
      • BBC-2109: Migration to CC7 (50%):
        • Only ~500 cores (CMS T0) pending in gva_project_004 (networking issues in OpenStack).
      • BBC-2028: Provisioning more 24cpu nodes in Geneva Project 041 (BE-ABP).
        • A total of 2400 cores is desired.
        • Using standard naming convention.
        • Testing Terraform on mixed flavor environments.
        • related:
          • build some new 32 cores bigmcore.
          • create a new wholemachine, fullnode hostgroup to consolidate sixteen and bigmcore: a hostgroup that only accepts full node jobs. This requires as well doing magic to convert user jobs to right sizes (18->16, 28->24, 48->32).
      • Fifemon probes:
        • New version almost ready. They were running for the last days succesfully sending data to fifecarbon02 (test graphite instance).
        • condorstats-t0, condorstats-prod, condorstats-test. condorstats-vcpool ?
        • Migration: change hieradata to point to fifecarbon01 (prod instance), stop condorstats01 and run puppet on the new instances.
        • next:
      • Exploring how to squeeze more efficiently the idle resources in the central managers (Ben?)
      • AMS public now allows u_va submission
        • AMS can now exit LSF going to 50/50 VMs in share / whole nodes in t0
        • Move all resources to CC7? Yes good point will confirm.
      • Kubernetes:
        • Do we want or already have chaos tools?
          • We have users.
        • Consul: use case for wtfis (egroups and schedds), workers, terraform states. Anything else?
          • replace other k/v use cases
          • maybe roger / drain state
      • Haggis
        • API wrapper:
          • Implementing the backend API call logic
          • Creating unit-tests
        • Website:
          • New version has a new implementation of a right-side drawer that works on every screen resolution and supports independent scrolling (very usefull in the Compute tab)
          • Grafana monitoring of errors and access time per page using the Prometheus client for Go - tested and fully functional!
          • New data-table fixed-headers feature is now available in 2.0 Alpha - currently waiting for the next major release
          • Please feel free to suggest any changes or report bugs here!
      • Kubernetes & Condor: CHEP?