============================================= Big Data Workshop Imperial College 2013-06-27 ============================================= David's talk ------------ Richard Mount ------------- High Energy Physics inc LHC Speaker: Dr. Richard Philip Mount (SLAC National Accelerator Laboratory (US)) - There is so much data because the universe isn't deterministic - Growing confidence in the Higgs discovery as more data used over time and LHC performance improves - No general access to raw data - several reconstruction steps to make data usable by thousands of physicists - Optimise use of technologies to do the science - Graph about real physics output - Disk access rate is slowing down over time, because disks are bigger and access performance is not increasing in turn - compared with CPU improvements - Becoming more reliant on networks, to overcome disk access limitations - reaching the point where network providers will be asking for money because of the amount of network capacity we're using The SKA - the world's largest big-data project - Paul Calleja ------------------------------------------------------------- - How much data processing can we afford to do over time? - similar philosophy to HEP - SKA aims to push boundary of radio astronomy - 6 orders of magnitude faster - IT project and astronomy is excuse to buy lots of computers - N-squared problem with numbers of antennae (real-time problem) - gridding and FFT, to handle signals arriving at receivers in different locations - continent-sized networks that must be able to deal with these challenging data rates - large proportion of budget is going on new networks - Most of compute at the experiment sites, 10% at sites around the world - there will be a data explosion when students get their hands on data - SKA 1 needs to be operation in 2019, so orders will be sent in 2018 - will be using 2018 technology - Large FFT so need to reference data back, so need to have a large buffer store (for 12 hours), 135 PB (observation buffer) - will be a leading Top 500 machine, at that time (2018) - 300 PB persistent data store - SKA 2 is scheduled for 2023; not practical in Moore's law terms as of today - astronomers are worried about software complexity, which dwarfs hardware problems - costs are in the software, which isn't scalable - would need all software developers in the world - Exascale computing in the desert is a problem (as compared to normal well-established laboratories) - Even extrapolating squashed designs, with power efficiency gains, would be 800 cabinets, 30 megawatts - problems are soluble if collaborate with wider community; don't reinvent things - System management software is a big problem at exascale - Radio astronomy development driving super computer development now, as it did with EDSAC - question from shaun de witt about distributing data around the world, as CERN does - Paul: work packages to deal with this, distributing data to the edge Cloud Computing & Data-Intensive Research - Kenji Takeda -------------------------------------------------------- - Works in Tony Hay's group - Worldwide Telescope - going from tiled mosaic to seamless image - Connecting directly to JANET, to avoid congestion/delays over public internet - Working with Bath and DCC over data-intensive research - Community Capability Model for data-intensive research - tool to go from symptoms to diagnosis to action - Based on Cornell three-stool model - http://communitymodel.sharepoint.com - Question: Paul Lewis from JANET - about peering arrangements to Dublin data centre Big Data and the Earth Observation and Climate Modelling Communities - Philip Kershaw -------------------------------------------------------- - cf. difficulties in software development (cf. SKA) - software development always lags behind hardware, always a problem - some data that needs to be kept but isn't retrieved very often - Emphasis on storage, not compute, and making that available to a range of different communities - Question of whether to choose full cloud, just virtualization or bare-metal compute, for different use-cases - Earth observation: radical performance improvements when code parallelised - Expectation of self-service environment for cloud may be misleading - very important to provide good user documentation - Researcher said "this is a game-changer for us" - Question from Ewan MacMahon about GridFTP - did they use anything to manage this? e.g. FTS - was mainly ad-hoc; sometimes use Globus Online; better management needed Big Data needs at ECMWF - Baudouin Raoult ----------------------------------------- - Operational centre - Some of their needs are more time-critical than e.g. SKA/LHC - 16km 2-D grid over the globe, at 91 levels - Example of 1987 storm, that was in the data but not captured by the main model - Size of data 1:1 link with power of their compute capability - Roughly 60% growth per annum - If data size grows exponentially, data input per day also grows exponentially - Cannot keep up with the size of the tape media library, growing too quickly - Architecture designed to be able to move files around without having to alter the metadata (slide 16) - Can migrate data without stopping the service (slide 18) - Created huge files to improve manageability of the system (slide 19), by reducing the number of files - Produce 4 TB per cycle, twice a day (slide 22) - Creative solutions will be needed to cope with new use-cases and services - Question from Wahid: what technology is used to store data in the index? - object-oriented database - specifically built for them, not general-purpose - company they were using went bust, which caused a problem PanData and the Research Data Alliance - Juan Bicarregui -------------------------------------------------------- - A lot of policy activity in this area - Sharing technologies across different projects/experiments - economy of scale benefits for technologies - cross-disciplinary benefits - users' expertise in one field can move to another easily - Open Science agenda: - separate infrastructure from the process of science - researchers don't want to have to think about location of data, how it's curated etc. - ICAT: dotted lines where they lose central control of the data (users running their own analyses etc.); main focus of current work - tracking provenance - keep track of which software applicable to which data (cataloguing) - Tomographic reconstruction: compute takes an order of magnitude longer than the scanning - want to reduce this to comparable time to scanning (using GPU cluster) - would then do more work e.g. see results of scan before doing the next one, to allow tuning of scan - number of scans might not change, but quality would improve - Research Data Alliance - metaphor with bridges, now use minimal vertical structures to support the horizontal structures Economic and Social Science Research Data Landscape - Fiona Armstrong --------------------------------------------------------------------- - Traditional social science is survey data - expect a seismic shift in how social science is done, reusing existing data - Data not collected for social science purposes but can be used for them - Admin data: information collected for the purposes of administration e.g. tax records - privacy concerns - pods in which research using these datasets carried out, placed at various unviersities - Lots of interest in this communnity in research methods - NCRM - CLOSER; potentially a 100-year project, 100,000 babies - lots of value in talking across different disciplines - fiona.armstrong@esrc.ac.uk - Question: Shaun De Witt - asking particle physicists to talk to anthropologists - referred to EUDAT project, which is aiming to do this, workshop in September, 25 different disciplines Afternoon session ----------------- Bioinformatics - Guy Coates --------------------------- - cost of sequencing decreasing faster than Moore's law - currently doing 10 terabases a week; 17,000x more than 10 years ago, but don't have 17,000x the budget - USB attachable sequencers i.e. people can just turn up and try to use them - strong emphasis on metadata for files, so people can find things they need - a lot of data, rather than big data per se - big data is computing across all the data - field changes so quickly that time to development much more important than computational efficiency - Question: has he considered private clouds in existing infrastructure? - would help with regulatory problems - can it be done as reliably and cost-effectively as public cloud providers? ELIXIR project - Andrew Lyall ----------------------------- - moving to distributed data access, instead of centralised - can't keep growing at current rates - old usage model of downloading dataset you want and processing it locally is breaking down now - Embassy Cloud testing now, to run compute at the data site - data delivery in London, data production at Cambridge (roughly) - many different modalities of usage as well as lots of different types of data - slide 22 - EU provides orange, national level provides grey - biology: the big challenges of big data Nature volume 498: pages, 255-260 - Questions: I/O problems - is it access rate or streaming problem? - he's currently modelling these bottlenecks in software pipelines e.g. IOPs - looks like pipelines generate comparatively large number of IOPs compared with CPU cycles Big Data Requirements in Arts and Humanities - Andrew Prescott -------------------------------------------------------------- - engagement with big data is one of the drivers of transforming scholarly practices in the humanities - good engagement of digital artists with big data (possibly more than academics) - new types of dialogue needed to achieve the transformations that are needed - Questions: Jens - are there cultural barriers to these developments? - there have been commercial barriers e.g. licensing - there is now enthusiasm about possibilities of free use, but ignorance of how to do it - David: curation of social media - Andrew Lyall - ESFRI projects have been good for biology and may work for humanities (cf. CLARIN) - Biomed Bridges new EU projects and there's an equivalent one for social sciences DDN GPFS - Vic Cornell ---------------------- HDFS - Steve Loughran - Hortonworks ----------------------------------- - organised for workflows of streaming and processing - relax some traditional filesystem constraints to achieve this - accept failures are inevitable - operations teams are concerned about changes in the rates of failure, not individual failures per se - topology-aware filesystem, so can ensure that file replicates are stored on different racks, different switches, different power supplies (given the correct failure domains) - with latest Intel CPU parts, checksums can be computed in single CPU opcode, therefore very efficient - moving towards topology-aware applications - Big driver is where hardware is going and where desktops are going - laptops with SSDs will affect server world - less capacity, more bandwidth in servers by using laptop hard disks - this probably won't be possible in two years' time because all laptop disks will be SSDs - Question: Guy Coates - people had problems with shuffle phase when running scientific applications - merge phase is network heavy - can do in-machine shuffles - best optimisation is to generate less data - Question: what happens if a node dies? - missing blocks will be re-replicated - up to local site whether to e.g. fix the disk or just ignore it for now - care more about statistical failures than anything else - equivalent of worrying about a sector failing in a disk Lustre - John Swinburne - Intel ------------------------------- Ceph and Big Data - Patrick McGarry, Inktank -------------------------------------------- - question: when will CephFS be production ready? - question: Guy Coates- can I use different access methods at the same time? CERN experiences with EOS, S3 and Ceph - Dirk Duellmann ------------------------------------------------------- - split off disk system from archive system, because they're somewhat conflicting sets of requirements - RAIN = nodes - erasure encoding - bring storage overhead down for parts of data not frequently accessed - Apache Dynamo, Huawei system - doesn't think single WAN system can be performant and reliable at the same time Discussion ---------- - What about hotspotting of data?