BDII DEPLOYMENT SUMMARY

 

AP:

For the our BDII in our region we used to have a round robin DNS  configuration with two BDIIs. However, we found that one of our servers was too slow to provide reliable performance for IS clients to query. So instead of two single CPU servers, our BDII service now runs on a SMP blade server:

* 2 Xeon 3.0 GHz, 4GB Memory

We are planning to return to a redundant BDII configuration after our next server procurement during the later half of this year.

 

A Few sites in our region also run their own BDII services, namely: KEK, Tokyo, KISTI, LCG_KNU and PAKGRID. The majority of the remaining sites use ASGC's BDII services.

 

DECH:

Top Level BDII Situation in ROC DECH

In region Germany Switzerland we have 5 Top Level BDIIs in Place. As ROC we have encouraged our site managers many times to use a different BDII than the CERN one, but we will discuss the regional situation in one of our next regional meetings to improve the situation.

 

BDIIs at

--------

DESY-HH

grid-bdii0.desy.de

grid-bdii1.desy.de

 

DESY-ZN

lcg-bdii.ifh.de

 

FZK

bdii-fzk.gridka.de

 

Uni Freiburg

bdii.bfg.uni-freiburg.de

 

 

Usage:

------

* CSCS, GSI, ITWM, SCAI

lcg-bdii.cern.ch

 

* MPPMU, Uni Wuppertal

bdii-fzk.gridka.de (lcg-gridka-bdii.fzk.de)

 

* RWTH-Aachen, Uni Dortmund, Uni Karlsruhe grid-bdii.desy.de

 

CE:

BDII setup in CE:

 

In CE we put regional BDII in production on 2.10.2006. All of 24 production sites with ca. 1700 CPUs are using it.

The regional BDII is at bdii.cyf-kr.edu.pl which in fact resolved to TWO IP addresses (using DNS "A" record: 149.156.9.24, 161.53.0.229). So BDII services are hosted in Poland and Croatia. That has the advantage of LOAD BALANCING and FAILOVER.

Load balancing is done since DNS returns IP addresses in different order each query. So each second query goes to the same machine.

Failover is done since LCG GFAL library being given with TWO IP addresses tries to access the firts one and if that fails it transparently tries THE SECOND one. So if the first host in e.g. Poland is not available the other one from Croatia is used.

 

HW/SW Setup:

- Polish machine: Dual CPU Intel(R) Xeon(TM) CPU 2.40GHz, 1GB RAM, ATA100 IDE HD D. SLC3 3.0.7, running solely toplevel BDII.

 

- Croatian machine:

AMD Opteron(TM) Processor 248 CPU 2.2 GHz 1024 KB L2, SCSI mirrored Hard Disk (Sun Fire X4100) Scientific Linux SL release 4.3 (Beryllium), solely toplevel DBII

 

- Experience

 

We experienced no major functionality problems with the machines (a few "connection timeouts" per week in total)

 

Having the other machine allows for carrying out short maintenance works transparently - no notification needed, sites are not even aware of maintenance works, the other machine handles all queries.

 

- Some technical details for single machine

- network peaks at the level of 500 kb/sec incoming traffic (bdii update) and

   450 kb/sec outgoing traffic.

- occasionally load goes to the level of 4-5. Not investigated why yet.

 

 

Italy:

We have a 6/7 top level BDII in Italy, but only one is used by the Italian Sites(egee-bdii.cnaf.infn.it). This is a DNS alias for 2 machines: egee-bdii-05.cnaf.infn.it and egee-bdii-06.cnaf.infn.it. They are both SunFire V20z, with AMD Opteron(tm) Processor 252, RAM 4GB, HD 60GB raid 1, they are monitored with nagios and lemon. The memory usage is ~ 1.5GB, the CPU load is ~45% (for each machine). Al Italian WNs point to this bdii (LCG_GFAL_INFOSYS=egee-bdii.cnaf.infn.it:2170). The BDII list is autogenerated and available at http://grid-it.cnaf.infn.it/fileadmin/bdii/egee-all-sites.conf (ALL Production SITEs in GOCDB + Other Italian Sites), no FCR. We experience no functionality problem, in case of scheduled downtime at CNAF, the DNS alias is redirect to a TOP-BDII at INFN-PADOVA, with the same site list, but a less powerfull hardware. We have also other top-level bdii, only pointed by some RBs/WMS, with different scope (only italian sites, certification sites, VO oriented with FCR, etc).

 

SEE:

In SEE ROC we have set up round robin dns for the dns alias bdii.egee-see.org and most of the sites in the region use that one as a top BDII. this allias points to bdii.phy.bg.ac.yu bdii101.grid.ucy.ac.cy bdii.athena.hellasgrid.gr

 

SWE:

3 top level BDIIS:

bdii.pic.es

ii02.lip.pt

bdii-egee.bifi.unizar.es

 

LIP-Lisbon has one top-bdii with the following setup: Machine : Sun x2100 cpu model name : Dual Core AMD Opteron(tm) Processor 175 @ 2211.351 MHz Memory : 2 Gb

 

NE:

We have two top level BDIIs, one at PDC and one at SARA, that we both intend to support for general use.

Here are some details about these BDIIs:

 

PDC:

Our is running on a 2.8 GHz Intel P4 processor with 800 MHz front side bus,

2 GB RAM and an 80 GB 7200 rpm IDE hard disk. The motherboard is a Supermicro P4SCE. We haven't any failover provisions either. As for the future, I don't think the hardware specifications of SweGrid II is finalized yet.

 

SARA:

At SARA we have a top level BDII running (mu33.matrix.sara.nl). Currently this is a Xeon dual processor system without any failover provisions (in case of failure we will have to move the service to another node).

We are in the process of buying new hardware for different core services and the top level BDII will be one of the services that will be hosted on this new hardware. For the reliability of these services we will use both redundant hardware and software solutions (e.g. HA Linux) dependent also on the kind of service.

 

UKI:

The top-level BDII is a dual CPU 2.66GHz Xeon with 2GB memory (about 1GB used); periods of very high (user + system) CPU usage frequently seen.  The immediate plan is to deploy a second box and use round-robin DNS, and of course monitor the situation.

 

Our observation over the last month or so is that timeouts are affecting the SAM CE tests (also complaints from users).

 

Sometimes the load is predominantly local (in particular RBs, VO Box, UIs).  Sometimes the load is mostly from the Tier-2s.  I've attached a file that is the output from a script that parses the bdii-fwd.log files for the last 30 days, and groups the connections by host and by site.  In summary, the sites/hosts with the most connections by day for the last 30 days are:

 

    Connects    Most active site             Connects    Most active Host

 

23472    gridpp.rl.ac.uk                 15855        dgc-grid-44.brunel.ac.uk

98441    tier2.hep.manchester.ac.uk      18165        dgc-grid-44.brunel.ac.uk

40805    gridpp.rl.ac.uk                 22404        lcgui0360.gridpp.rl.ac.uk

19934    tier2.hep.manchester.ac.uk      13912        lcgrb02.gridpp.rl.ac.uk

15736    gridpp.rl.ac.uk                 13666        lcgrb02.gridpp.rl.ac.uk

40002    gridpp.rl.ac.uk                 13972        lcgrb02.gridpp.rl.ac.uk

48734    gridpp.rl.ac.uk                 16480        lcgui0361.gridpp.rl.ac.uk

44535    gridpp.rl.ac.uk                 17820        fal-pygrid-19.lancs.ac.uk

52279    tier2.hep.manchester.ac.uk      22289        fe01.esc.qmul.ac.uk

35565    gridpp.rl.ac.uk                 23370        fe01.esc.qmul.ac.uk

140349    tier2.hep.manchester.ac.uk      37503       fal-pygrid-19.lancs.ac.uk

116356    tier2.hep.manchester.ac.uk      35214       fe01.esc.qmul.ac.uk

61043    gridpp.rl.ac.uk                 50382        fe01.esc.qmul.ac.uk

53576    gridpp.rl.ac.uk                 28610        lcgvo0339.gridpp.rl.ac.uk

18538    gridpp.rl.ac.uk                 12250        lcgvo0339.gridpp.rl.ac.uk

21435    gridpp.rl.ac.uk                 12543        lcgui0357.gridpp.rl.ac.uk

29208    gridpp.rl.ac.uk                  9580        fe01.esc.qmul.ac.uk

25929    gridpp.rl.ac.uk                  2985        dgc-grid-44.brunel.ac.uk

37031    gridpp.rl.ac.uk                  8366        gfm01.pp.rhul.ac.uk

23084    tier2.hep.manchester.ac.uk      18649        gfm01.pp.rhul.ac.uk

26313    gridpp.rl.ac.uk                 17404        gfm01.pp.rhul.ac.uk

40069    gridpp.rl.ac.uk                 15918        lcgrb02.gridpp.rl.ac.uk

57020    gridpp.rl.ac.uk                 13742        fe01.esc.qmul.ac.uk

61542    gridpp.rl.ac.uk                 16961        lcgrb02.gridpp.rl.ac.uk

41385    gridpp.rl.ac.uk                 17193        lcgrb02.gridpp.rl.ac.uk

63603    gridpp.rl.ac.uk                 20712        svr031.gla.scotgrid.ac.uk

76584    gridpp.rl.ac.uk                 24600        lcgrb02.gridpp.rl.ac.uk

60533    gridpp.rl.ac.uk                 19646        lcgvo0339.gridpp.rl.ac.uk

55622    gridpp.rl.ac.uk                 21346        fe01.esc.qmul.ac.uk

131099    gridpp.rl.ac.uk                 56356       lcgvo0339.gridpp.rl.ac.uk

 

 

 

 

 

CERN:

Total number of BDIIs: 12 (cluster gridbdii).

 

8 top-level BDIIs behind alias lcg-bdii.cern.ch alias (subcluster lcg-bdii). All of them are using FCR.

 

2 site-level BDIIs behind alias prod-bdii.cern.ch alias (subcluster prod-bdii). FCR not used.

 

2 top-level BDIIs behind sam-bdii.cern.ch alias (subcluster sam-bdii).

All of them are using FCR.

 

 

Present status: http://lxb2007.cern.ch/lcgbdii/stats_lcg-bdii.html

Statistics collected from lcg-bdii.cern.ch (Sunday 18 February 2007)

(bdii101 - bdii102 - bdii105 - bdii106 - bdii107 - bdii108 - bdii111 - bdii112)

connections/day - connections/s

Top-10 connections/host (today)

Top-10 connections/domain (today)

18/02 : 1579106 18.2
17/02 : 1544938 17.8
16/02 : 1901918 22.0
15/02 : 1739994 20.1
14/02 : 1386148 16.0
13/02 : 1382117 15.9
12/02 : 1650772 19.1
11/02 : 1715956 19.8
10/02 : 2048746 23.7
09/02 : 2047570 23.6

61902 : lcgvm.triumf.ca
59155 : egee-rb-01.mi.infn.it
50606 : fe01.esc.qmul.ac.uk
49843 : bigmac-fw.physics.utoronto.ca
48037 : rb101.cern.ch
27916 : nat-1-out-1.lnl.infn.it
25948 : rb114.cern.ch
24813 : rb123.cern.ch
16522 : nat-outside-fzk.gridka.de
15357 : gfm01.pp.rhul.ac.uk

326068 : .ch
238222 : .it
208314 : .fr
150360 : .uk
135981 : .ca
107034 : .nl
85466 : .de
78306 : .es
58187 : .gr
28076 : .jp