================================ Big Data Workshop June 27th 2013 ================================ ROOT for Big Data Analysis - Fons Rademakers -------------------------------------------- - important for data preservation that ROOT files and the format are self-describing - Questions: - How easy is it to use ROOT in other science domains? How closely is it coupled to HEP? - How well do your data types map to classes in ROOT? - e.g. statistical mapping over time series - Astronomers are using it, some work at Imperial - How does SQL interface compare in performance to native C++? - Uses vector acceleration of current CPUs AVX2 etc. - SQL people used that and performance was very good Hadoop Data Processing - Steve Loughran - Hortonworks ----------------------------------------------------- - People don't need to know details of dealing with job parcelling out, cluster management etc. - HDFS: break big files into blocks, replicate those blocks and run work scheduler on those blocks - Job scheduler tries to find the machine where blocks are, to avoid network traffic, and achieve maximum computational efficiency - Pig: looks from result backwards to decide what you want, thereby discarding things in which you aren't interested - Take stuff written for social media networks, steal it and repurpose it e.g. Giraph, has been used in Bristol for modelling the human heart - Planned improvements: - treat mapreduce as a job you run on a Hadoop cluster, rather than a core part of the system, facilitates upgrades etc. - YARN - Pig/Hive-Tez - deals with iteration; cf. Dryad - Hamster: attempt to do MPI - What should we do with these tools and how can we democratise it? - Not stuck with MapReduce - Steal other people's code, start with Pig - Questions: what if users say they want to run my tool on my data format, using Hadoop? - Can deploy native code using YARN, if your application doesn't care where it starts - HBase works well because it uses Zookeeper to keep track of where things are - Which stores?: HBase, Dremel, Impala (coming from Cloudera) Contrast between big data processing in academia and industry - Simon Metson - Cloudant ----------------- - companies have much shorter timelines, so accumulate technical debt by releasing Minimum Viable Product quickly, then add things later - heavily using cloud providers, so they don't need to invest upfront in hardware - in academia, more likely to define a question to be investigated, rather than just saying "we don't know what to do with our data" - in industry, don't tend to run test jobs in the same way as academia - would have development/test, pre-prod and production systems - need to build a team, maintain it and specialise, rather than using a constant stream of students Optimising bioinformatics pipelines for clinical genomics Speaker: Dr. Michael Mueller (Imperial College) ------------------------------------------------------ http://www.imperial.ac.uk/clinicalgenomeinformatics - data explosion in genomics in last five years requires completely new approaches to data processing - Imperial trying to create similar genetic disease pipeline - split dataset before staging to improve performance - massive performance improvements by using scatter-gather techniques - Questions: difficult to split things into evenly sized chunks - what about splitting into many, much smaller chunks? - would be possible, but they don't have resourcecs to develop a more complex pipeline like that - How much more optimisation does he think is possible? - thinks he's reached the limit with current resources and software limitations - How fast does it need to be in the clinic? - ideally within a day - What about starting analysis while sequencing is ongoing? - for most parts of this pipeline (e.g. variant calling), need to see entire dataset Marmal-aid: a tool for genomics processing 10' Speaker: Dr. Rob Lowe (QMUL) ----------------------------------- - Epigenetics - Methylation data - Relatively new field, so datasets are growing very rapidly - Many different places generating these datasets, not a single source of data - Public repositories now, stored as flat-text files - hard to compare from one experiment to another - requires a lot of scripting - Taken data from these repositories and put it into standardised format, using R - efficient, cross-platform binary format - Consistency is a problem with such a variety of data sources - r.lowe@qmul.ac.uk - http://marmal-aid.org - Questions: don't worry about filesystems too much because imputation code is very CPU-dependent - Are people happy with R? - Very easy to build a tool quickly - well built for bioinformatics work - really efficient Astronomy toolkits and data structures Speaker: Dr. Adrian Jenkins (Durham University) ------------------------------------------------------ - database meeting: - Millennium Workshop 2012 - http://galformod.mpa-garching.mpg.de/portal/workshop2012/ - Questions: - Any simulation only as good as the underlying data - how do they check the validity of their data? - Have very good measurements of the microwave background - Need phenomenological models when working with data below the resolution of their instruments - Visualization: in HEP see it as for the public - is it more important for cosmologists? - used for debugging but generally using maths/equations - more accessible visually to the public - Is the centralization of DiRAC needed, or can it be more distributed cf. WLCG? - simulations themselves need to run on tightly-coupled systems - once that data produced, could be analysed elsewhere - What about SQL queries? - Just using single servers running MySQL at the moment Discussion ---------- - Does anyone feel a need to change their technology and/or re-use technology from a different field? - Earth observation archival - how do they change ways of thinking and formats and tools to enable parallelisation? - How many people will be using Hadoop in next few years? - a small number - may not be Hadoop, but will be massively-parallel, high throughput stuff (Earth observation man), because original codes not easily suseceptible to parallelisation - Proofs-of-concept, re-using others' work (Jens), then disseminating - would be good to provide example applications, for re-use by different communities - David: architectures used to analyse data will change quite a lot over next few years, ecosystem will become more varied - how well will codes be ported to these architectures? - Rob Lowe: genomics pipelines - new aspect will be adding things to the sequence e.g. clinical data, epigenetics, other experiments (cell experiments) - how to store all this information in a framework or toolkit to pull in information from different domains easily and process it easily - Ewan: Hadoop may be better because it can do something for everyone, while not being perfect for everyone Data Storage - Hardware ----------------------- Hardware for big data: lessons learned Speaker: Marcel van Drunen (Dell) ----------------------------------------- Hadoop Hardware sizing - lessons learnt --------------------------------------- - distinction between two sets of users: academics vs. Facebook-style - can't really re-purpose clusters designed for HPC, for e.g. Hadoop - don't try to combine the two - Hadoop targeted at multi-petabyte workloads, so don't want to be bothered about single node failures, therefore have lots of fairly small nodes - Hadoop not suited to lots of small files, prefer to aggregate into larger files - Questions: how to network racks together? - determining factor is the input data stream? How fast is it? - and the efflux data stream, if there is one - hasn't encountered Hadoop-style customer not satisfied with 1 GigE or 2 x 1GigE connections to individual nodes - How you use data is another determinant for bandwidth? - E.g. would need more if serving data to public Data Evolution: 101 - James Coomer ---------------------------------- - access methods to object stores tend to be diverse e.g. from many devices, APIs, and methods - Erasure coding - not confined to particular RAID sets, with concomitant problems Shaun De Witt ------------- - tests using RESTful interface ("Modified test" slide) - impressed by the standards-based RESTful interface (in contrast to talk yesterday that standards hold things back) - Questions: 10 object creations a second sounds low - Thousands would be normal (may have been fixed by firmware update) - Dirk Duellmann: should people who are trying to reimplement a hierarchical namespace use it (if even you don't try to do that) - hand over inodes/metadata problem to users - DDN uses Riak database for metadata management - UNIX tool approach - does single thing well - iRODS driver? - both trying to do worldwide replication, so need to change driver to allow WOS to handle it - this new driver should be released soon (composable driver) - waiting for legal niceties Afternoon session ----------------- Network Developments - Paul Lewis - JANET ----------------------------------------- - E.g. has private peering to Google, so a Google query from a UK university wouldn't touch the internet - Lightpath network is effectively a standalone network - goes through research and education networks, not the internet - significant power consumption by e.g. T4000 equipment requires 17kVA - roughly 40Gbit/s into Google and Geant (peering) - using E-Infrastructure funding to "build for the future" - High Throughput Networking SIG will be run for at least 3 years - need local expertise to help achieve the thought experiment speeds - Questions: - Ewan- peering with Amazon Cloud and would this change transfer pricing? - JANET has historically been a blackbox as far as sites are concerned - end to end perspective needed - UK is not network-limited (Pete Clarke) - Pete Clarke reiterated Amazon transfer pricing question - Cloud Brokerage The Evolution of the FTS File Transfer Service Tool 20' Speaker: Michail Salichos (CERN) ---------------------------------------- - transfer is not only the copying - checking files at both ends, comparing sizes and checksums, may integrity steps - Questions: - David: found debugging FTS issues rather difficult - Would such a service be useful in other communities? - When will it be available? - Expect it by the end of 2013 - Do they keep metrics of the endpoints? - stored persistently in the database - keep historical data to be able to decide whether to initiate a transfer or not - perfSonar metrics? - perfSonar information, to be useful for FTS3, would have to be real-time - actual existing FTS metrics have proven very useful in network troubleshooting Federated Data Stores - Volume, Velocity & Variety Speaker: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER) -------------------------------------------------------- - Increasing data volume inevitably means more sites - In the academic world (where data placed and why placed there) often determined by politics and economics - Volume: can carve tree into subtrees however you want - Not looking at location of file, looking at how to get to a file, which is a very different model - Administrative autonomy is key for academic environments - Questions: - Climate scientists had problem even agreeing a hierarchy - CMS example, dynamic way of traversing the hierarchy, based on monitoring performance EUDAT technology choices Speaker: Mark van de Sanden (SURFsara) ---------------------------------------------- - Used iRODS for replication because it's policy-based - Questions: - Pete Clarke: Whom is this aimed at? Is it only for PRACE? - No. E.g. emerging communities - Is in fact re-using existing software, Globus Online, iRods - They tried FTS and it didn't fit for them - Problems caused by lack of support for persistent identifiers in existing projects, another reason for EUDAT Discussion ---------- - Software Defined Networks - what effects will they have? - In HEP, limitations are now the storage element, not the network - Much more network-centric view of things now - David: federated data is a way of giving ubiquitous access to data e.g. from a laptop - Richard Mount: what worries him for the future is not the functions of the system, rather how do we maintain and operate systems for 20 years? They run around the clock. Should we be trying to foist these things on other communities that don't have equivalent resources, if such monitoring needed to keep the system reliable? - David: each different community has to decide which parts of other people's tools are useful for their own needs - need to honestly communicate pros and cons of different approaches to each other Data Management: meta-data, data discovery and preservation Convener: Richard Bantges -------------------------------- - 50% of data in ECMWF archive is only 18 months old - BADC is one of the main sources of data Environmental Data Archival Speaker: Stephen Pascoe (CEDA) ------------------------------ Earth System Science: Adapting metadata and search to the Big Data era ---------------------------------------------------------------------- - represents users, archival community, more than technologists (to an extent) - rethinking current approaches, now that they need to deal with Big Data - many users haven't tried to analyze >1PB datasets yet, but partly because they don't realise they would be able to - BADC catalogue diagram, shows that some datasets are well curated, others less so - reflects sociology of how data produced, more than anything else - migrating from XML documents to Django/SQL-based database - Operating at intersection of Big Data and Long Tail Data - CMIP5: tried to use existing off the shelf technologies but ended up not using them - commodity search solution, working across whole federation (SOLR) - federated search was a big step forward for their users - Gone large way towards solving top-down and bottom-up data problems (in CMIP5) - Thinks Use metadata should drive applications in the future, not Context metadata - Questions: - what are they using for metadata search? - would like to evaluate CKAN; currently a bespoke system - How will users gain access to data? - intention is to allow remote mounting of the filesystems - group workspaces - considering building an API layer e.g. they can put their own interface on top, using their VMs, using the speed of the parallel filesystem - access control is a problem - wary of users with root on their VMs The Application of Raimes' Rules to Long-Term Data Preservation Speaker: Dr. Jamie Shiers (CERN) --------------------------------------------------------------- - Stan Raimes differential - Google search - there will always be some bit-level errors, though you can mitigate this - he doesn't think EUDAT solves all the listed problems - should pursue links with Research Data Alliance - important to build in long-term data preservation requirements in future work - need metrics to be able to understand how well we're doing (data preservation maturity model) - Questions: - David: plans for making data public are quite far advanced - End user code needs to be retained as well e.g. CMS VM with golden version of analysis code - policy that analysis code kept in a repository - Richard Mount: wary of releasing data to public - will this create a large support burden? wary of a backlash? Outsider input to decide whether data release is actually useful for the public? - it's going to happen anyway, so we should prepare for it - Making data intelligible for people - there should be control over data formats, to mitigate this problem Data management 2: Open access and preservation Convener: Jens Jensen (CLRC-RAL) ----------------------------------------------- DOIs for tracking data Speaker: Mr. Matthew James Viljoen (STFC) ----------------------------------------- - Questions: - Kati - do we need DOIs for CMS release of 2010 data? - Ewan: cultural impact of DOIs (makes expectation of data curation as part of looking after data) - couldn't that be achieved by using URLs? - in fact, DOIs are stable, and they can point to an URL - Steve Loughran: could I use a DOI in a program i.e. it could use the DOI to pick up a specific dataset? - no reason why not - Kati: what size? - huge range - MBs to GBs - Jens: what about payment for DOIs? - cost might be prohibitive for millions of datasets - cost of maintaining data objects for long-term is much higher Digital preservation Speaker: Jonathan Tilbury (Tessela) ------------------------------------------- - National archives open source tool droid to check bit-level integrity http://www.nationalarchives.gov.uk/information-management/our-services/dc-file-profiling-tool.htm - Roughly 30% of PDFs do not conform to the standard - Questions: - Chris Walker: what happens if you go bust or are taken over by an evil corporation? - storage layer constructed so that you have backdoor access, if you're using local hardware - can create storage adapter for data export from cloud storage - system is escrowed e.g. to National Archive - Ewan: how does it handle file format issue? - identify files as they come in - use tools to characterise that content, extract metadata - go to pronom data base maintained by national archives (of file formats) http://www.nationalarchives.gov.uk/PRONOM/Default.aspx - check handling of the file format, then transfer to file format usable today, checking the integrity (validation) - RAL is using HDF5 - Shaun de Witt: - Preservica tool, using S3 - a lot of data must stay within national borders - can restrict data to a particular region on Amazon Final discussion ----------------