================================
Big Data Workshop June 27th 2013
================================

ROOT for Big Data Analysis - Fons Rademakers
--------------------------------------------

- important for data preservation that ROOT files and the format are
  self-describing

- Questions:

  - How easy is it to use ROOT in other science domains?  How closely
    is it coupled to HEP?

    - How well do your data types map to classes in ROOT?

    - e.g. statistical mapping over time series

    - Astronomers are using it, some work at Imperial

  - How does SQL interface compare in performance to native C++?

    - Uses vector acceleration of current CPUs AVX2 etc.

    - SQL people used that and performance was very good


Hadoop Data Processing - Steve Loughran - Hortonworks
-----------------------------------------------------

- People don't need to know details of dealing with job parcelling
  out, cluster management etc.

- HDFS: break big files into blocks, replicate those blocks and run
  work scheduler on those blocks

- Job scheduler tries to find the machine where blocks are, to avoid
  network traffic, and achieve maximum computational efficiency

- Pig: looks from result backwards to decide what you want, thereby
  discarding things in which you aren't interested

- Take stuff written for social media networks, steal it and repurpose
  it e.g. Giraph, has been used in Bristol for modelling the human
  heart

- Planned improvements:

  - treat mapreduce as a job you run on a Hadoop cluster, rather than
    a core part of the system, facilitates upgrades etc.

  - YARN

  - Pig/Hive-Tez - deals with iteration;  cf. Dryad

- Hamster: attempt to do MPI

- What should we do with these tools and how can we democratise it?

- Not stuck with MapReduce

- Steal other people's code, start with Pig

- Questions: what if users say they want to run my tool on my data
  format, using Hadoop?

  - Can deploy native code using YARN, if your application doesn't
    care where it starts

    - HBase works well because it uses Zookeeper to keep track of
      where things are

  - Which stores?: HBase, Dremel, Impala (coming from Cloudera) 


Contrast between big data processing in academia and industry - Simon
Metson - Cloudant
-----------------

- companies have much shorter timelines, so accumulate technical debt
  by releasing Minimum Viable Product quickly, then add things later

- heavily using cloud providers, so they don't need to invest upfront
  in hardware

- in academia, more likely to define a question to be investigated,
  rather than just saying "we don't know what to do with our data"

- in industry, don't tend to run test jobs in the same way as academia

  - would have development/test, pre-prod and production systems

- need to build a team, maintain it and specialise, rather than using
  a constant stream of students


Optimising bioinformatics pipelines for clinical genomics
Speaker:	Dr. Michael Mueller (Imperial College)
------------------------------------------------------

http://www.imperial.ac.uk/clinicalgenomeinformatics

- data explosion in genomics in last five years requires completely
  new approaches to data processing

- Imperial trying to create similar genetic disease pipeline

- split dataset before staging to improve performance

- massive performance improvements by using scatter-gather techniques

- Questions: difficult to split things into evenly sized chunks

  - what about splitting into many, much smaller chunks?

    - would be possible, but they don't have resourcecs to develop a
      more complex pipeline like that

  - How much more optimisation does he think is possible?

    - thinks he's reached the limit with current resources and
      software limitations

  - How fast does it need to be in the clinic?

    - ideally within a day

  - What about starting analysis while sequencing is ongoing?

    - for most parts of this pipeline (e.g. variant calling), need to
      see entire dataset


Marmal-aid: a tool for genomics processing 10'
Speaker:	Dr. Rob Lowe (QMUL)
-----------------------------------


- Epigenetics

- Methylation data

- Relatively new field, so datasets are growing very rapidly

- Many different places generating these datasets, not a single source
  of data

- Public repositories now, stored as flat-text files

  - hard to compare from one experiment to another

  - requires a lot of scripting

- Taken data from these repositories and put it into standardised
  format, using R

  - efficient, cross-platform binary format

- Consistency is a problem with such a variety of data sources

- r.lowe@qmul.ac.uk

- http://marmal-aid.org

- Questions: don't worry about filesystems too much because imputation
  code is very CPU-dependent

  - Are people happy with R?

  - Very easy to build a tool quickly 

  - well built for bioinformatics work

    - really efficient


Astronomy toolkits and data structures
Speaker:	Dr. Adrian Jenkins (Durham University)
------------------------------------------------------

- database meeting:

  - Millennium Workshop 2012

  - http://galformod.mpa-garching.mpg.de/portal/workshop2012/

- Questions:

  - Any simulation only as good as the underlying data - how do they
    check the validity of their data?

    - Have very good measurements of the microwave background

    - Need phenomenological models when working with data below the
      resolution of their instruments

  - Visualization: in HEP see it as for the public - is it more
    important for cosmologists?

    - used for debugging but generally using maths/equations

    - more accessible visually to the public

  - Is the centralization of DiRAC needed, or can it be more
    distributed cf. WLCG?

    - simulations themselves need to run on tightly-coupled systems

    - once that data produced, could be analysed elsewhere

  - What about SQL queries?

    - Just using single servers running MySQL at the moment


Discussion
----------

- Does anyone feel a need to change their technology and/or re-use
  technology from a different field?

  - Earth observation archival - how do they change ways of thinking
    and formats and tools to enable parallelisation?

- How many people will be using Hadoop in next few years?

  - a small number

  - may not be Hadoop, but will be massively-parallel, high throughput
    stuff (Earth observation man), because original codes not easily
    suseceptible to parallelisation

- Proofs-of-concept, re-using others' work (Jens), then disseminating

  - would be good to provide example applications, for re-use by
    different communities

  - David: architectures used to analyse data will change quite a lot
    over next few years, ecosystem will become more varied
    
    - how well will codes be ported to these architectures?

- Rob Lowe: genomics pipelines

  - new aspect will be adding things to the sequence e.g. clinical
    data, epigenetics, other experiments (cell experiments)

  - how to store all this information in a framework or toolkit to
    pull in information from different domains easily and process it
    easily

- Ewan: Hadoop may be better because it can do something for everyone,
  while not being perfect for everyone


Data Storage - Hardware
-----------------------


Hardware for big data: lessons learned
Speaker:	 Marcel van Drunen (Dell)
-----------------------------------------

Hadoop Hardware sizing - lessons learnt
---------------------------------------

- distinction between two sets of users: academics vs. Facebook-style

- can't really re-purpose clusters designed for HPC, for e.g. Hadoop

  - don't try to combine the two

- Hadoop targeted at multi-petabyte workloads, so don't want to be
  bothered about single node failures, therefore have lots of fairly
  small nodes

- Hadoop not suited to lots of small files, prefer to aggregate into
  larger files

- Questions: how to network racks together?

  - determining factor is the input data stream?  How fast is it?

  - and the efflux data stream, if there is one

  - hasn't encountered Hadoop-style customer not satisfied with 1 GigE
    or 2 x 1GigE connections to individual nodes

  - How you use data is another determinant for bandwidth?

    - E.g. would need more if serving data to public


Data Evolution: 101 - James Coomer
----------------------------------

- access methods to object stores tend to be diverse e.g. from many
  devices, APIs, and methods

- Erasure coding - not confined to particular RAID sets, with
  concomitant problems


Shaun De Witt
-------------

- tests using RESTful interface ("Modified test" slide)

- impressed by the standards-based RESTful interface (in contrast to
  talk yesterday that standards hold things back)

- Questions: 10 object creations a second sounds low

  - Thousands would be normal (may have been fixed by firmware update)

  - Dirk Duellmann: should people who are trying to reimplement a
    hierarchical namespace use it (if even you don't try to do that)

    - hand over inodes/metadata problem to users

    - DDN uses Riak database for metadata management

    - UNIX tool approach - does single thing well

  - iRODS driver?
  
    - both trying to do worldwide replication, so need to change
      driver to allow WOS to handle it

    - this new driver should be released soon (composable driver)

    - waiting for legal niceties


Afternoon session
-----------------


Network Developments - Paul Lewis - JANET
-----------------------------------------

- E.g. has private peering to Google, so a Google query from a UK
  university wouldn't touch the internet

- Lightpath network is effectively a standalone network

- goes through research and education networks, not the internet

- significant power consumption by e.g. T4000 equipment requires 17kVA

- roughly 40Gbit/s into Google and Geant (peering)

- using E-Infrastructure funding to "build for the future"

- High Throughput Networking SIG will be run for at least 3 years

- need local expertise to help achieve the thought experiment speeds

- Questions: 

  - Ewan- peering with Amazon Cloud and would this change transfer
  pricing?

  - JANET has historically been a blackbox as far as sites are
    concerned

    - end to end perspective needed

  - UK is not network-limited (Pete Clarke)

  - Pete Clarke reiterated Amazon transfer pricing question

    - Cloud Brokerage


The Evolution of the FTS File Transfer Service Tool 20'
Speaker:	 Michail Salichos (CERN)
----------------------------------------

- transfer is not only the copying

  - checking files at both ends, comparing sizes and checksums, may
    integrity steps

- Questions:

  - David: found debugging FTS issues rather difficult

  - Would such a service be useful in other communities?

  - When will it be available?

    - Expect it by the end of 2013


  - Do they keep metrics of the endpoints?

    - stored persistently in the database

    - keep historical data to be able to decide whether to initiate a
      transfer or not

  - perfSonar metrics?

    - perfSonar information, to be useful for FTS3, would have to be
      real-time

    - actual existing FTS metrics have proven very useful in network
      troubleshooting 


Federated Data Stores - Volume, Velocity & Variety 
Speaker: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR
CENTER)
--------------------------------------------------------

- Increasing data volume inevitably means more sites

- In the academic world (where data placed and why placed there) often
  determined by politics and economics

- Volume: can carve tree into subtrees however you want

- Not looking at location of file, looking at how to get to a file,
  which is a very different model

- Administrative autonomy is key for academic environments

- Questions:

  - Climate scientists had problem even agreeing a hierarchy

    - CMS example, dynamic way of traversing the hierarchy, based on
      monitoring performance


EUDAT technology choices 
Speaker:	 Mark van de Sanden (SURFsara)
----------------------------------------------

- Used iRODS for replication because it's policy-based

- Questions:

  - Pete Clarke: Whom is this aimed at?  Is it only for PRACE?

    - No.  E.g. emerging communities

    - Is in fact re-using existing software, Globus Online, iRods

    - They tried FTS and it didn't fit for them

    - Problems caused by lack of support for persistent identifiers in
      existing projects, another reason for EUDAT


Discussion
----------

- Software Defined Networks - what effects will they have?

  - In HEP, limitations are now the storage element, not the network
  
  - Much more network-centric view of things now

  - David: federated data is a way of giving ubiquitous access to data
    e.g. from a laptop

  - Richard Mount: what worries him for the future is not the
    functions of the system, rather how do we maintain and operate
    systems for 20 years?  They run around the clock.  Should we be
    trying to foist these things on other communities that don't have
    equivalent resources, if such monitoring needed to keep the system
    reliable?

    - David: each different community has to decide which parts of
      other people's tools are useful for their own needs

      - need to honestly communicate pros and cons of different
        approaches to each other


Data Management: meta-data, data discovery and preservation
Convener: Richard Bantges
--------------------------------

- 50% of data in ECMWF archive is only 18 months old

- BADC is one of the main sources of data


Environmental Data Archival 
Speaker: Stephen Pascoe (CEDA)
------------------------------


Earth System Science: Adapting metadata and search to the Big Data era
----------------------------------------------------------------------

- represents users, archival community, more than technologists (to an
  extent) 

- rethinking current approaches, now that they need to deal with Big
  Data

- many users haven't tried to analyze >1PB datasets yet, but partly
  because they don't realise they would be able to

- BADC catalogue diagram, shows that some datasets are well curated,
  others less so

  - reflects sociology of how data produced, more than anything else

- migrating from XML documents to Django/SQL-based database 

- Operating at intersection of Big Data and Long Tail Data

- CMIP5: tried to use existing off the shelf technologies but ended up
  not using them

- commodity search solution, working across whole federation (SOLR)

  - federated search was a big step forward for their users

- Gone large way towards solving top-down and bottom-up data problems
  (in CMIP5)

- Thinks Use metadata should drive applications in the future, not
  Context metadata

- Questions:

  - what are they using for metadata search?

    - would like to evaluate CKAN;  currently a bespoke system

  - How will users gain access to data?

    - intention is to allow remote mounting of the filesystems

    - group workspaces
    
    - considering building an API layer e.g. they can put their own
      interface on top, using their VMs, using the speed of the
      parallel filesystem

      - access control is a problem

      - wary of users with root on their VMs


The Application of Raimes' Rules to Long-Term Data Preservation
Speaker: Dr. Jamie Shiers (CERN)
---------------------------------------------------------------

- Stan Raimes differential - Google search

- there will always be some bit-level errors, though you can mitigate
  this

- he doesn't think EUDAT solves all the listed problems

- should pursue links with Research Data Alliance

- important to build in long-term data preservation requirements in
  future work

- need metrics to be able to understand how well we're doing (data
  preservation maturity model)

- Questions:

  - David: plans for making data public are quite far advanced

  - End user code needs to be retained as well e.g. CMS VM with golden
    version of analysis code

    - policy that analysis code kept in a repository

  - Richard Mount: wary of releasing data to public - will this create
    a large support burden? wary of a backlash?  Outsider input to
    decide whether data release is actually useful for the public?

    - it's going to happen anyway, so we should prepare for it

  - Making data intelligible for people

    - there should be control over data formats, to mitigate this
      problem


Data management 2: Open access and preservation
Convener: Jens Jensen (CLRC-RAL)
-----------------------------------------------

DOIs for tracking data 
Speaker: Mr. Matthew James Viljoen (STFC)
-----------------------------------------

- Questions:

  - Kati - do we need DOIs for CMS release of 2010 data?

  - Ewan: cultural impact of DOIs (makes expectation of data curation
    as part of looking after data)
    
    - couldn't that be achieved by using URLs?

    - in fact, DOIs are stable, and they can point to an URL

  - Steve Loughran: could I use a DOI in a program i.e. it could use
    the DOI to pick up a specific dataset?

    - no reason why not

  - Kati: what size?

    - huge range - MBs to GBs

  - Jens: what about payment for DOIs?

    - cost might be prohibitive for millions of datasets

    - cost of maintaining data objects for long-term is much higher


Digital preservation
Speaker:	 Jonathan Tilbury (Tessela)
-------------------------------------------

- National archives open source tool droid to check bit-level
  integrity


http://www.nationalarchives.gov.uk/information-management/our-services/dc-file-profiling-tool.htm


- Roughly 30% of PDFs do not conform to the standard

- Questions: 

  - Chris Walker: what happens if you go bust or are taken over by an
    evil corporation?

    - storage layer constructed so that you have backdoor access, if
      you're using local hardware

    - can create storage adapter for data export from cloud storage 

    - system is escrowed e.g. to National Archive

  - Ewan: how does it handle file format issue?

    - identify files as they come in

    - use tools to characterise that content, extract metadata

    - go to pronom data base maintained by national archives (of file
      formats)

      http://www.nationalarchives.gov.uk/PRONOM/Default.aspx

    - check handling of the file format, then transfer to file format
      usable today, checking the integrity (validation)

    - RAL is using HDF5

  - Shaun de Witt:

    - Preservica tool, using S3
    
    - a lot of data must stay within national borders

      - can restrict data to a particular region on Amazon


Final discussion
----------------