\documentclass{CHEP2006}
%\documentclass{article}

\usepackage[dvips]{graphicx}
\usepackage{amssymb}
\usepackage{url}  % used to fix hyphenation of URLs

\begin{document}
\title{GRID DATA MANAGEMENT: SIMULATIONS OF LCG 2008}

\author{A. T. Doyle, C. Nicholson\footnote{Presenting author.}\\
Dept. of Physics and Astronomy, University of Glasgow, Glasgow, G12
8QQ, Scotland \\
}

\maketitle

\begin{abstract}
Simulations have been performed with the grid simulator OptorSim using
the expected analysis patterns from the LHC experiments and a
realistic model of the LCG at LHC startup, with thousands of user
analysis jobs running at over a hundred grid sites. It is shown,
first, that dynamic data replication plays a significant role in the
overall analysis throughput in terms of optimising job throughput and
reducing network usage; 
second, that simple file deletion algorithms
such as LRU and LFU algorithms are as effective as economic models;
third, that site policies which allow all experiments to share
resources in a global Grid is more effective in terms of data access
time and network usage; and lastly, that dynamic data management
applied to user data access patterns where particular files are
accessed more often (characterised by a Zipf power law function) lead
to much improved performance compared to sequential access.
\end{abstract}

\section{Introduction}
\label{section:intro}
Particle physicists are currently preparing for the Large
Hadron Collider (LHC) at CERN, the European Organization for Nuclear
Research, to start data-taking. 2008 will be the first full
year of running, and to handle the expected
15 PB/year of raw data, 
%plus secondary data 
%produced in
%reconstruction, analysis and simulation, 
the LHC experiments have
adopted grid-based solutions. The LHC Computing Grid
(LCG) project has been established to provide and maintain the
data storage and analysis infrastructure.
% for all the LHC experiment collaborations. 

LCG has adopted a tiered architecture, with CERN as a
Tier-0 site where all raw data are produced and
archived. Tier-1
sites are responsible for a share of permanent data storage and
computational power for reprocessing and analysis. 
%There will be approximately ten Tier-1 sites, which will
%act as Regional Centres for the region where they are situated. 
Each Tier-1 will have a number of associated Tier-2 sites,
each providing computing power. 
The LHC experiments will use the different tiers
in slightly different ways according to their own computing models; 
while these computing models are well-developed, the actual
behaviour of LCG during LHC running remains
unknown. Simulation may therefore be a useful tool to examine
this. In particular, the data management components
may be simulated and ways of improving grid performance
investigated. The grid simulator OptorSim~\cite{optorsim-release}, originally
developed as part of the European DataGrid (EDG) project~\cite{edg},
has been designed to explore the
effects of dynamic data replication: replicating files between sites
in response to jobs as they run. It has been used to simulate a
model of LCG in 2008 and this paper presents some of the results. 
%First, a
%brief description of OptorSim is given, followed by the experimental
%setup. The results of experiments investigating different data
%replication algorithms, site policies and data access patterns are
%then given before drawing some conclusions. 

\section{OptorSim}
OptorSim's emphasis is on simulation
of the replica management infrastructure, as dynamic data
replication involves automated decisions about replica placement and
deletion. The architecture and implementation are described
in~\cite{CCM+04} and only a brief description is given here. 

%\subsection{Architecture}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=5cm]{figures/architecture}
\caption{Grid architecture used in OptorSim.}
\label{figure:architecture}
\end{center}
\end{figure}
In the OptorSim model (Figure~\ref{figure:architecture}), the grid consists of sites connected
by network links; a site may have a Computing Element (CE), a Storage
Element (SE) or both. Each site also has a Replica Optimiser (RO)
which makes decisions on replications to that site. A Resource Broker
(RB) schedules jobs to sites. Jobs process
files, which can be replicated between sites
according to the decisions made by the RO. A Replica Catalogue holds
mappings of logical to physical filenames and a Replica
Manager handles replications and registers them in the Catalogue.

\subsubsection{Grid Topology.}
To input a grid topology, a user specifies the storage capacity
and computing power at each site, and the 
network links between each. SEs are defined to have a certain
capacity, in MB, and CEs to have a certain number of ``worker nodes''
with a given processing power. 

\subsubsection{Jobs and Files.}
A list of jobs and the files that they need is defined; a job
will process some or all of the files in its
dataset, according to the \emph{access pattern}.
The time a file takes to process depends on its size and on
the worker nodes at the CE. 
%It is assumed that the output files from a physics analysis would be small
%enough to ignore compared to the input, and that these
%will be stored at local sites rather than on the grid, and so no
%simulation of output files is required.

\subsubsection{Site Policies.}
Different sites are likely to prioritise
different jobs; a site with strong involvement in 
ATLAS, for example, may prefer to accept ATLAS jobs. 
In OptorSim, each site is given a list of job types which it
will accept. 

\subsubsection{Optimisation Algorithms. }
There are two kinds of optimisation algorithm which may be
investigated using OptorSim: the job scheduling algorithms used 
to decide which sites jobs should be sent to, and the data
replication algorithms used by the RO at each site.
The focus of this paper is on the data
replication algorithms, for which there are three options available.
Firstly, one can choose to perform no replication. Secondly,
one can use a ``traditional'' algorithm which always tries to
replicate files, deleting existing files if necessary. Algorithms in this
category are the LRU (Least Recently Used), which deletes those
files which have been used least recently, and the LFU (Least
Frequently Used), which deletes those which have been used least
frequently. Thirdly, one can use an economic model
in which sites ``buy'' and ``sell'' files using an auction mechanism,
and will only delete files if they are less valuable than the new
file. Details can be found in~\cite{BCC+03a}. There are currently two versions:
 the binomial economic model, where file values are
predicted by ranking the files in a binomial distribution according to
their popularity, and the Zipf economic
model, where a Zipf-like distribution is used.
% (a Zipf
%distribution is one in which a few events occur very frequently, while
%most events occur infrequently).  

%\subsection{Evaluation Metrics}
%To compare the results of different simulations, a number of metrics
%can be defined. First, in characterising different grids, the
%\emph{storage metric}, $D$, is the ratio of average SE size to total
%dataset size; the \emph{network metric}, $C$, is the average
%connectivity of a grid site; and the \emph{compute metric}, $P$, is
%the average processing power at a site. 

%For evaluating grid performance, different users may have different
%criteria. An ordinary user will most likely be interested in the time
%a job takes to complete. The owners of grid resources, on the other
%hand, will want to see their resources being used efficiently. The
%evaluation metrics used in this paper are \emph{mean job time}, which is the
%average time a job takes to run, from the time of scheduling to
%completion, and
%; \emph{CE usage}, which is the average percentage of simulation
%time for which a CE is active; 
%\emph{effective network usage (ENU)},
%which is the ratio of file requests which use network resources to the
%total number number of file requests.
%; and \emph{hit rate}, which is
%the fraction of tiles that a file request by a job is satisfied by a
%file which is already present on that site's SE. 

\section{Experimental Setup}
\label{section:setup}
OptorSim was set up using the predicted LCG resources for 2008 as a
basis. While some simplifications were necessary for the simulation to
run, the aim was to have a simulation which yielded useful information
about grid behaviour. 

\subsubsection{Analysis Jobs and Files. }
The experiment computing model documents
(\cite{alice-comp-model},~\cite{atlas-comp-model},~\cite{cms-comp-model},~\cite{lhcb-comp-model}) describe their analysis
models. All experiments
except LHCb plan to do most of the analysis at Tier-2 sites, with
data storage at Tier-1s. This was modelled by assigning each experiment a dataset, which was placed at
each Tier-1 site and at CERN at the start of the simulation. Six
job types were defined, and each dataset divided into 2 GB
files, with  
%Assuming that a ``typical'' analysis job runs over $10^6$
%events (with the \texttt{lhcb-big} jobs running over $10^7$ events)
%and taking the AOD event sizes from the computing models gave the
parameters for each job type as presented in Table~\ref{table:jobs}. 
\begin{table}[h!tbp]
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
Job & Event & Total no. & Files\\
& size (kB) & of files & per job \\
\hline
\hline
\texttt{alice-pp} & 50 & 25000 & 25 \\
\hline
\texttt{alice-hi} & 250 & 12500 & 125 \\
\hline
\texttt{atlas} & 100 & 100000 & 50 \\
\hline
\texttt{cms} & 50 & 37500 & 25 \\
\hline
\texttt{lhcb-small} & 75 & 37500 & 38 \\
\hline
\texttt{lhcb-big} & 75 & 37500 & 375 \\
\hline
\end{tabular}
\caption{Job configuration parameters used.}
\label{table:jobs}
\end{center}
\end{table}
%The total size of the dataset for each is about the
%size of a single copy of the AOD for a year's worth of data
%taking. 
The jobs processed a subset of files from the dataset  according to the access
pattern. 
%When a job ran on a site, it 
%retrieved its files and processed them according to the computing resources
%available at that site. 
Processing times per file were calculated for
according to the expected time per event during
analysis.
%, and the probability of a particular job being run on the grid was
%modelled by the relative number of expected users for the different
%experiments, taken from the computing model documents.
%Estimates of the numbers of active users for each
%experiment are given in the computing models and are shown in
%Table~\ref{table:users}. The approximate number of ALICE users was
%not published and hence was estimated at about 500 users. The split
%of probabilities between \texttt{alice-pp} and \texttt{alice-hi}
%jobs is 87.5\% to 12.5\%, while \texttt{lhcb-small} and
%\texttt{lhcb-big} are given 80\% and 20\% respectively of the LHCb
%share.
%\begin{table}[h!tbp]
%\begin{center}
%\begin{tabular}{|l|c|}
%\hline
%Experiment & Number of users \\
%\hline \hline
%ALICE & 500 \\
%\hline
%ATLAS & 600\\
%\hline
%CMS & 1000\\
%\hline
%LHCb & 140 \\
%\hline
%\end{tabular}
%\caption{Estimated number of active users for the LHC experiments.}
%\label{table:users}
%\end{center}
%\end{table}

\subsubsection{Storage Resources.}
The Tier-0 and Tier-1 sites were designated as ``master
sites'', with SEs according to their
planned capacities in~\cite{lcg-tdr}, 
as presented in Table~\ref{table:tier1-se}. 
%Table~\ref{table:tier1-se} also shows the experiments
% served by each site.
\begin{table}[h!tbp]
\begin{center}
\begin{tabular}{|l|c|c|}
\hline
Site & Storage  & Experiments served\\
& (PB) & \\
\hline \hline
CERN Tier-0 & 12.5 & All\\
\hline
CAF & 6.4 & All\\
\hline
TRIUMF & 1.5 & ATLAS\\
\hline
IN2P3 & 7.7 & All\\
\hline
GridKa & 4.0 & All\\
\hline
CNAF & 7.5 & All\\
\hline
NIKHEF/SARA & 5.2 & ALICE, ATLAS, LHCb\\
\hline
Nordic & 2.8 & ALICE, ATLAS, CMS\\
\hline
PIC & 3.5 & ATLAS, CMS, LHCb\\
\hline
ASCC & 2.5 & ATLAS, CMS\\
\hline
RAL & 3.6 & All\\
\hline
BNL & 5.1 & ATLAS\\
\hline
FNAL & 5.2 & CMS\\
\hline
\end{tabular}
\caption{LCG Tier-0 and Tier-1 storage resources for 2008.}
\label{table:tier1-se}
\end{center}
\end{table}
Detailed resource estimates are not available for all the Tier-2s,
so each Tier-2 site was given a canonical value. Averaging
the total Tier-2 requirements over the number of sites
gave an average SE size of 197 TB. Defining a \emph{storage metric},
$D$, as the ratio of average SE size to total dataset size allows
characterisation of a grid in terms of the proportion of the dataset
individual SEs can hold. 
If $D > 1$, an average SE can
hold all the files, so the choice of replication strategy
will have no effect. For $D < 1$, the replication strategy becomes
more important, but
if $D << 1$ due to a very large dataset, replication will begin to
lose its advantage, as each job is likely to request new files. 
%An SE size of 197 TB gives a value for $D$ of 0.47. 

The size of the simulation, however, limited the number of
jobs which could be simulated, due to the
available memory when running the simulation. The
simulations were restricted to about 1000 jobs, so the
Tier-2 SE sizes were scaled down to 500 GB,
allowing file replacement to start quickly. 
The disadvantage is that 
$D$ is then very small, so the file prediction algorithms will not
perform to their best advantage. The effect of changing $D$ by
changing the size of the dataset, however, is among the tests presented.

\subsubsection{Computing Resources.}
As most analysis jobs run at Tier-2 sites, the Tier-1 sites
were not given CEs, except those which run LHCb jobs and
were therefore given a CE equal to those at the Tier-2s. 
%In reality,
%of course, the Tier-1s have large computing resources, but as the
%focus here is on analysis, they are assumed to be reserved for
%reconstruction and thus unavailable for analysis. 
The CERN Analysis
Facility (CAF) is a special case, and was allocated a CE of 7840
kSI2k. The Tier-2s were given an averaged CE
of 645 kSI2k, to meet the total requirement of 61.3 MSI2k. 

\subsubsection{Network Topology.}
                                                                               
The network topology (Figure~\ref{figure:lcg2008_topology}) 
 was developed using the
topologies of the main research networks, simplified
slightly. Sites were connected with the published bandwidths if
these were available and 155 Mbps otherwise. 
%As the simulation was geared towards the user analysis view of
%the grid, where resources are available via the standard research
%networks rather than the dedicated paths which will be available for
%initially transporting data from CERN to Tier-1 sites, this is not
%inappropriate. Sites with both a Tier-1 and a Tier-2 facility
%had the Tier-2 attached directly to the Tier-1 by a 1 Gbps
%link. 
\begin{figure*}[h!tbp]
\begin{center}
\includegraphics[width=13cm]{figures/lcg2008}
\caption{Simulated topology of LCG in 2008. 
%CERN, as the Tier-0
%site, is shown in yellow, while Tier-1 sites are green, Tier-2s are
%red and router nodes are black. Network links have the values shown in
%the key.
}
\label{figure:lcg2008_topology}
\end{center}
\end{figure*}

\section{Results}
For each of the results presented, a
scheduler was used which combines information on data location and
lengths of queues at sites.
%; this has been shown in previous studies to give the best grid
%usage~\cite{CCM+04}. 
In each test, 1000 jobs were submitted. The metric used in evaluation
is the \emph{mean job time}, which is the
average time a job takes from scheduling to completion.

\subsubsection{Effects of Data Replication}
The first test presented examines the performance of the replication
algorithms with different values of the storage
metric, $D$. The overall dataset size was successively halved, varying
$D$ from $1.2\times10^{-3}$ to $7.5\times10^{-2}$ and bringing it closer to
the more realistic level of $\mathcal O(10^{-1})$. The results of
this test are shown in Figure~\ref{figure:dataset}.
\begin{figure}[h!tbp]
\begin{center}
\includegraphics[width=6cm]{figures/dataset-jobtime}
%\includegraphics[width=6.5cm]{figures/dataset-enu}
\caption{Mean job time with varying $D$.}
\label{figure:dataset}
\end{center}
\end{figure}
This shows, first, that for low $D$, dynamic data replication gives
little benefit. As $D$ increases, however, replication gives
up to 20-25\% gain in performance, with the simpler LRU and LFU
strategies giving better performance than the economic models. 
Increasing the number of jobs shows a
linear increase in job time, so the relative improvement in
performance would hold.

%Although these results were gained with 1000 grid jobs,
%Figure~\ref{figure:numjobs} shows the variation in job time with an
%increasing number of jobs.
%\begin{figure}[h!tbp]
%\begin{center}
%\includegraphics[width=6.5cm]{figures/numjobs-jobtime}
%\caption{Mean job time with varying number of jobs.}
%\label{figure:numjobs}
%\end{center}
%\end{figure}
%This shows a linear increase in job time with number of
%jobs. Extrapolating to $\mathcal O(10000)$ jobs, which is a more
%realistic number, this linear relationship is expected to hold.
%This means that with
%realistic values of $D$, and higher numbers of jobs, the relative
%improvement in performance would hold. Replication is therefore an
%important way of reducing job times and network usage, and the
%relatively simple LRU and LFU strategies are the most effective. 

\subsubsection{Effects of Site Policies}
In the last section, site policies were set according to their planned
usage. Here, this is compared with two extremes of
policy. In the first, called \emph{All Job Types}, all sites accepted all
job types. In the second, designated \emph{One Job Type}, each site
would accept only one job type, with an even distribution of sites
for each job type. The CAF still accepted all
job types. The default set of policies is between
these two extremes, and is designated in the results below as
\emph{Mixed}. The results are shown in
Figure~\ref{figure:policies}. 
\begin{figure}[h!tbp]
\begin{center}
\includegraphics[width=6cm]{figures/policies-jobtime}
%\includegraphics[width=6.5cm]{figures/policies-enu}
\caption{Mean job time for varying site
policies.}
\label{figure:policies}
\end{center}
\end{figure}
These results show that the pattern of site policies on the
grid have a powerful effect on performance. The mean job time with
the \emph{All Job Types} policy is about 60\% lower than with the
\emph{One Job Type} policy. This is true across all the replication
strategies, although the effect is strongest with no replication and
with the LRU. 
%\emph{All Job Types} also gives a lower ENU
%(about 25\% lower than the others) 
It seems clear that an egalitarian
approach, in which resources are shared as much as possible, yields
benefits to all grid users. 
%It would therefore be recommended that
%experiment collaborations make every effort to share their resources
%widely.

\subsubsection{Effects of Data Access Pattern}
In the previous sections, file access was
sequential. Other access patterns are also possible, and
perhaps the most likely of these in a chaotic analysis situation is a
Zipf-like access pattern, with a probability distribution
$P_i\propto i^{-\alpha}$, where $i$ is the file
rank and $\alpha$ is the power value. 
%An example could be files containing data from a set of
%possible Higgs events, which would attract a great deal of attention
%from physicists. 
\cite{IR02} showed that in the D0 experiment at FNAL, the least popular files
followed a Zipf-like pattern while there were many popular
files which were all accessed with the same frequency. This
corresponds to the use of the sequential access pattern in OptorSim. This
%observation may be specific to the D0 sample studied, or may apply
%to HEP experiments in general, but 
gives strong motivation to examine the relative effects of sequential
and Zipf access patterns with OptorSim.

Figure~\ref{figure:zipf} shows the results
of using a Zipf-like access pattern with $\alpha = 0.85$.
\begin{figure}[h!tbp]
\begin{center}
\includegraphics[width=6cm]{figures/zipf-jobtime}
%\includegraphics[width=6.5cm]{figures/zipf-enu}
\caption{Mean job for varying access patterns.}  
\label{figure:zipf}
\end{center}
\end{figure}
Although the four replication algorithms still have very similar
performances, they are now about 75\% faster than without
replication. 
%The ENU is correspondingly lower. 
This is due to the way in which a few files
from each job's fileset are accessed many times during the jobs,
while others are accessed infrequently. This allows the access
histories to predict file values more accurately than with the
sequential pattern, where they may see a file only once. 
%As the
%number of jobs and the proportion of the whole dataset seen by an
%individual SE increases, however, the results with sequential access
%should tend towards a similar pattern as for the Zipf access. This
%is borne out by the results from varying $D$ with sequential access.
The presence of any Zipf-like element, even if combined with a
sequential pattern, would make dynamic replication
highly desirable. 

%\section{Related Work}

\section{Conclusions}
The grid simulator OptorSim has been used to simulate a model of LCG
during the first year of LHC running, exploring several different
aspects. First, it was shown that dynamically replicating data between
sites using a sequential file access pattern decreased the running
time of grid jobs by about 20\%, 
%and reduced usage of the network by about 25\%, 
especially as sites' Replica Optimisers gained more knowledge
of the overall dataset. While the performances of different
replication strategies were similar, the simpler LRU and LFU
strategies were found to perform up to 20\% and 30\% better,
respectively, than the economic models. Examining site policies, it
was found that a policy which allowed all experiments to share
resources on all sites was most effective in reducing data access time.
Finally, it was shown that if user data access
patterns include a Zipf-like element, 
dynamic replication has a much stronger effect
than with sequential access, with gains in performance of about 75\%.
It is quite likely that an analysis situation would involve Zipf-like
elements in data access patterns, and so implementing such an
automated file replication and deletion tool for analysis would give
significant gains. In future, if sub-file level replication were
implemented, these gains could be even greater.

%In future, there is much scope for refining the simulation model and
%extending this work, such as more detailed modelling of computing
%resources at sites, output files from jobs, and fluctuation of
%available resources. It would also be interesting to examine the
%effects of replication at the event, rather than the file, level, and
%to include reconstruction and simulation jobs as well as analysis-type
%jobs. 



\section{Acknowledgments}
This work was funded by PPARC. Thanks to all the members of the EDG
WP2 Optimisation Team, whose work allowed this research to be conducted. 

\begin{thebibliography}{99} % Use for 10-99 references
 
\bibitem{edg}
The European DataGrid Project, http://www.edg.org
 
\bibitem{optorsim-release}
{OptorSim Release 2.0}, November 2004.
\newblock
  \url{http://edg-wp2.web.cern.ch/edg-wp2/optimization/optorsim.html}.

\bibitem{CCM+04}
D. Cameron et al, 
`` Evaluating Scheduling
and Replica Optimisation Strategies in OptorSim'', Journal of Grid
Computing 2(1):57-69, March 2004.

\bibitem{BCC+03a}
W. Bell et al, `` Evaluation of an Economy-Based Replication Strategy for a
Data Grid'', Int. Workshop on Agent Based Cluster and Grid
Computing, Tokyo, 2003

\bibitem{alice-comp-model}
{ALICE Computing Model}.
\newblock Technical Report CERN-LHCC-2004-038/G-086, CERN, January 2005.

\bibitem{atlas-comp-model}
{The ATLAS Computing Model}.
\newblock Technical Report CERN-LHCC-2004-037/G-085, CERN, January 2005.

\bibitem{cms-comp-model}
{The CMS Computing Model}.
\newblock Technical Report CERN-LHCC-2004-035/G-083, CERN, January 2005.

\bibitem{lhcb-comp-model}
{LHCb Computing Model}.
\newblock Technical Report CERN-LHCC-2004-036/G-084, CERN, January 2005.

\bibitem{lcg-tdr}
{LHC Computing Grid Technical Design Report}.
\newblock Technical Report CERN-LHCC-2005-024, CERN, June 2005.

%\bibitem{MoU}
%{Memorandum of Understanding for Collaboration in the Deployment and
%  Exploitation of the Worldwide LHC Computing Grid}.
%\newblock Technical Report CERN-C-RRB-2005-01, CERN, September 2005.

\bibitem{IR02}
A.~Iamnitchi and M.~Ripeanu.
\newblock Myth and reality: Usage behavior in a large data-intensive physics
  project.
\newblock Technical Report TR2003-4, GriPhyN, 2003.

\end{thebibliography}

\end{document}

