4. Conclusions / Future plans
The objective of reasonable organisation of scientific data on the grid is not a new one. Already, many approaches especially in file replication show good improvements. We argue though that once we approach petascale, low-level file reorganisation is not sufficient anymore and a global view of grid dataflow must be taken into account. We provide a preliminary model and its accompanying tools to understand erratic and unpredictable dataflows and show their usefulness in the production EGEE grid.
1. Short overview
Scientific applications on the grid are in most cases heavily data-dependent. Therefore, improving scheduling decisions based on the co-allocation of data and jobs becomes a primary issue. Hence, it is crucial to analyse the behaviour of existing data management systems in order to provide accurate information for decision-making middlewares in a scalable way. We show current research issues in understanding the behaviour of data management systems on the petascale to improve grid performance.
Provide a set of generic keywords that define your contribution (e.g. Data Management, Workflows, High Energy Physics)
Data Management, Dataflow, Grid Behaviour, Petascale
We characterise three areas for improvement of dataflow. First, controlled data transfers issued by experiment operators or gridsite operators. This is constant data export from the experiment to distributed computing facilities, mostly defined by experiment computing models. Second, dynamic data transfers issued by jobs on a gridsite. Those production jobs may need to access data that is only available on remote sites. Third, uncontrolled data transfers issued by endusers; scientists fetching data for direct analysis. We argue that on the petascale complete replication of files is not a suitable option anymore as there is too much data and that erratic and unpredictable data movements are the norm. Furthermore it is important to value the relevance of certain data with respect to time to find useful data on the grid. Our model derives those usage patterns implicitly. Therefore global data movement and usage patterns on data must be taken into account when doing job/data co-allocation.
URL for further information: