INFN Visit on Agile Infrastructure

Name: INFN Visit on Agile Infrastructure
Start: 2015-07-16T09:00:00+02:00
End: 2015-07-17T12:00:00+02:00
Location: CERN

16 Jul 2015, 09:00 → 17 Jul 2015, 12:00 Europe/Zurich

600/R-001 (CERN)

600/R-001

CERN

Show room on map

Doina Cristina Aiftimiei (INFN), Tim Bell (CERN)

Description

Full immersion discussions on Agile Infrastructure Management and other topics

Thursday 16 July
- Thu 16 Jul
- Fri 17 Jul
- 09:00 → 09:10
  
  Introductions 10m
- 09:10 → 09:30
  
  Agile Infrastructure at CERN 20m
  
  Speaker: Tim Bell (CERN)
  
  20150716_INFN_Agile_Visit_v1.pdf
  
  20150716_INFN_Agile_Visit_v1.pptx
- 09:30 → 09:50
  
  INFN Status 20m
  
  infn_cloud_cern_final.pdf
  
  infn_cloud_cern_final.pptx
- 09:50 → 10:10
  
  CERN Procurement and Data Centre Infrastructure 20m
  
  - Wigner status - Networking needs and technologies - What services are running at Wigner ? - Wigner hardware intervention process - How are urgent interventions handled ? - Security at Wigner - Purchasing cycle - Automated installation
  
  Speaker: Tim Bell (CERN)
- 10:10 → 10:55
  
  Configuration Management 45m
  
  o Tools and approach taken o Foreman Environments Host groups ? Bare metal o Puppet Scope of use of Puppet management (Compute, Services, Storage, Windows)  Handling change Continuous Integration • Git workflow
  
  Speaker: Ben Jones (CERN)
  
  infn_visit_puppet.pdf
  
  infn_visit_puppet.pptx
- 10:55 → 11:15
  
  Coffee 20m
- 11:15 → 12:00
  
  Cloud Architecture 45m
  
  Components used Configuration Cells OpenStack Controller HA Scalability Message queues
  
  Speaker: Belmiro Moreira (CERN)
  
  INFN - Cloud Architecture.pdf
- 12:00 → 12:30
  
  Cloud Operations 30m
  
  Packaging - RDO-like RPMs - CentOS Cloud SIG Upgrade approach - Major release - Minor releases CentOS 7 migration Hypervisor tuning Rally Hardware lifecycle - Repairs - Retirement Team organisation
  
  Speaker: Jan van Eldik (CERN)
  
  Cloud Operations - INFN visit, July 205.pdf
  
  Cloud Operations - INFN visit, July 205.pptx
- 12:30 → 14:00
  
  Lunch 1h 30m
- 14:00 → 14:20
  
  Cloud Networking 20m
  
  o Physical network topology o Nova network configuration o Neutron proposed migration o SDN needs
  
  Speaker: Jose Castro Leon (CERN)
  
  Cloud Networking - INFN Visit, July 2015.pdf
  
  Cloud Networking - INFN Visit, July 2015.pptx
- 14:20 → 14:40
  
  Identity and Security 20m
  
  Account management - User lifecycle and Cornerstone Traceability / audit given root access to VMs - Log levels and granularity - Problem determination Image management - User image upload - Contextualisation process Image management - Classification - Sharing images - Signing images ? - Ensuring patch updates ?
  
  Speakers: Jose Castro Leon (CERN), Sebastian Bukowiec (CERN)
  
  Identity and Security - INFN Visit, July 2015.pdf
  
  Identity and Security - INFN Visit, July 2015.pptx
- 14:40 → 15:00
  
  Resource management 20m
  
  - Approach for cloud resource management - Personal projects - Quotas - Mapping to experiments - Accounting -- Internal with ceilometer -- External with APEL - Ongoing work on nested projects
  
  Speaker: Jose Castro Leon (CERN)
  
  Resource Management - INFN Visit, July 2015.pdf
  
  Resource Management - INFN Visit, July 2015.pptx
- 15:00 → 15:20
  
  Federated Clouds 20m
  
  - Work with OpenStack community and Rackspace -- SAML / Keystone 2 Keystone -- EduGain - EGI Federated Cloud discussions -- OpenStack realm ?
  
  Speaker: Tim Bell (CERN)
  
  20150716_INFN_Federation.pdf
  
  20150716_INFN_Federation.pptx
  
  Documentation
  
  EGI FC
  
  Federation Overview
- 15:20 → 15:40
  
  Storage 20m
  
  - Overview of EOS, CASTOR, cernbox and Ceph - Data distribution across multiple data centres - How do WN access data remotely ? - Latency impacts on performance Ceph - Image - Block Object storage - radosgw ? Performance - SSD logs - SSD OSDs
  
  Speaker: Tim Bell (CERN)
  
  20150716_INFN_Storage_v1.pdf
  
  20150716_INFN_Storage_v1.pptx
  
  Scale test
  
  Summit presentation
Friday 17 July
- Thu 16 Jul
- Fri 17 Jul
- 09:00 → 09:30
  
  Agile Monitoring 30m
- 09:30 → 09:40
  
  Cloud monitoring 10m
  
  - Implementation using Kibana - Rally
  
  Speaker: Stefano Zilli (CERN)
  
  150330_rally_deployment_cern.pdf
  
  Some background for the ceilometer installation from Stefano Zilli
  
  We have mainly two use cases for now for Ceilometer data. One is for accounting purposes and the one is for orchestration. For accounting we keep a subset of metrics (instance,cpu,cpu_util,vcpu,memory,network.*)
  
  for 3 monts. We don't store at the moment anything related to disks since our accounting team does not need it. Plus we have a non standard metric that is cpu normalized to the hepsepc06 value of the the hypervisor. For orchestration we decided to start with keeping only cpu_util. In this case the data is stored only for a few days.
  
  As backends we are using:
  
  - accounting:
  
       - HBase: I don't have much to say about this database since we don't manage it ourselves but I think the data volume was of about 20TB (this is outdated, after removing disk* metrics I think we have 10TB).
  
  This is the total for the 3 monts. One thing we noticed is that data is stored in probably a non optimal way. It grows always on the same region so we end up with lot of region splits.
  
  - orchestration:
  
       - MySQL: we store only the alarms in here. This way we can change metric storage when we want without the need of moving around the alarms
  
       - MongoDB: we have a small 3 nodes replica set. For the metric collection we use a capped collection. This way we are able to define a maximum size that is less than the ram of the VM and hopefully keep most of the database in memory. Also, we avoid using TTL and the data on disk is not growing indefinitely (this is something I really don't like of mongo ... disk is never released). We are using the new WiredTiger as engine, so far it looks better than the default one. For resources it's not possible to use capped collections since the records are updated and they may have a different size.
  
  Compute and Collectors:
  
  Not much to say. Polling time is 10 minutes. We have two publishers:
  
  notifier for accounting and udp for orchestration. We have a separate rabbit deployment for ceilometer since we don't want to impact nova or cinder rabbits in case of incidents. For every cell we deploy a set of nova apis that are connected with the cell database. This way we spread the load generated by the compute agent requests across multiple databases and not only the top cell one.
  
  We still need to have some collectos/agent-notification listening on the other rabbits for the notifications coming from the other components. I think in Kilo it will be possible to have the notification agent listen to multiple rabbits and reinsert in a different one.
  
  API:
  
  Running on apache mod_wsgi. We have separate APIs for the two use cases.
  
  Central agent:
  
  We have 2 agents running coordinated via zookeeper. We don't collect metrics for the hypervisors themselves. This mainly because we don't need it and ... the burst of requests is insane.
  
  Alarms:
  
  We have multiple instances of the notifier and the evaluator running coordinated as well via zookeeper.
- 09:40 → 10:20
  
  Batch outlook at CERN 40m
  
  CERN experiences with Condor LSF and Condor outlook Puppet configuration of virtual batch nodes
  
  Speaker: Iain Bradford Steers (CERN)
  
  infnvisit_batch.pdf
- 10:20 → 11:20
  
  Next Steps 1h
  
  Potential areas for collaboration - Batch and resource management - Configuration - Monitoring - Federated Clouds - Containers - Indigo Dataclouds