=============================================
Big Data Workshop Imperial College 2013-06-27
=============================================

David's talk
------------

Richard Mount
-------------

High Energy Physics inc LHC 
Speaker:	Dr. Richard Philip Mount (SLAC National Accelerator
Laboratory (US)) 

- There is so much data because the universe isn't deterministic

- Growing confidence in the Higgs discovery as more data used over
  time and LHC performance improves

- No general access to raw data

  - several reconstruction steps to make data usable by thousands of
    physicists

- Optimise use of technologies to do the science

- Graph about real physics output

- Disk access rate is slowing down over time, because disks are bigger
  and access performance is not increasing in turn

  - compared with CPU improvements

- Becoming more reliant on networks, to overcome disk access
  limitations

  - reaching the point where network providers will be asking for
    money because of the amount of network capacity we're using


The SKA - the world's largest big-data project - Paul Calleja
-------------------------------------------------------------

- How much data processing can we afford to do over time?

  - similar philosophy to HEP

- SKA aims to push boundary of radio astronomy

  - 6 orders of magnitude faster

- IT project and astronomy is excuse to buy lots of computers

- N-squared problem with numbers of antennae (real-time problem)

  - gridding and FFT, to handle signals arriving at receivers in
    different locations

- continent-sized networks that must be able to deal with these
  challenging data rates

  - large proportion of budget is going on new networks

- Most of compute at the experiment sites, 10% at sites around the
  world

  - there will be a data explosion when students get their hands on
    data

- SKA 1 needs to be operation in 2019, so orders will be sent in 2018

  - will be using 2018 technology

- Large FFT so need to reference data back, so need to have a large
  buffer store (for 12 hours), 135 PB (observation buffer)

- will be a leading Top 500 machine, at that time (2018)

- 300 PB persistent data store

- SKA 2 is scheduled for 2023;  not practical in Moore's law terms as
  of today

  - astronomers are worried about software complexity, which dwarfs
    hardware problems

  - costs are in the software, which isn't scalable

  - would need all software developers in the world

- Exascale computing in the desert is a problem (as compared to normal
  well-established laboratories)

- Even extrapolating squashed designs, with power efficiency gains,
  would be 800 cabinets, 30 megawatts

  - problems are soluble if collaborate with wider community;  don't
    reinvent things

- System management software is a big problem at exascale

- Radio astronomy development driving super computer development now,
  as it did with EDSAC

- question from shaun de witt about distributing data around the
  world, as CERN does

  - Paul: work packages to deal with this, distributing data to the
    edge


Cloud Computing & Data-Intensive Research - Kenji Takeda
--------------------------------------------------------

- Works in Tony Hay's group

- Worldwide Telescope

  - going from tiled mosaic to seamless image

- Connecting directly to JANET, to avoid congestion/delays over public
  internet

- Working with Bath and DCC over data-intensive research

- Community Capability Model for data-intensive research

  - tool to go from symptoms to diagnosis to action

  - Based on Cornell three-stool model

  - http://communitymodel.sharepoint.com

- Question: Paul Lewis from JANET

  - about peering arrangements to Dublin data centre


Big Data and the Earth Observation and Climate Modelling Communities -
Philip Kershaw 
--------------------------------------------------------

- cf. difficulties in software development (cf. SKA)

  - software development always lags behind hardware, always a problem

- some data that needs to be kept but isn't retrieved very often

- Emphasis on storage, not compute, and making that available to a
  range of different communities

- Question of whether to choose full cloud, just virtualization or
  bare-metal compute, for different use-cases

- Earth observation: radical performance improvements when code
  parallelised

- Expectation of self-service environment for cloud may be misleading

  - very important to provide good user documentation

- Researcher said "this is a game-changer for us"

- Question from Ewan MacMahon about GridFTP

  - did they use anything to manage this? e.g. FTS

  - was mainly ad-hoc;  sometimes use Globus Online;  better
    management needed


Big Data needs at ECMWF - Baudouin Raoult
-----------------------------------------

- Operational centre

  - Some of their needs are more time-critical than e.g. SKA/LHC

- 16km 2-D grid over the globe, at 91 levels

- Example of 1987 storm, that was in the data but not captured by the
  main model

- Size of data 1:1 link with power of their compute capability

  - Roughly 60% growth per annum

- If data size grows exponentially, data input per day also grows
  exponentially 

- Cannot keep up with the size of the tape media library, growing too
  quickly

- Architecture designed to be able to move files around without having to
  alter the metadata (slide 16)

- Can migrate data without stopping the service (slide 18)

- Created huge files to improve manageability of the system (slide
  19), by reducing the number of files

- Produce 4 TB per cycle, twice a day (slide 22)

- Creative solutions will be needed to cope with new use-cases and
  services

- Question from Wahid: what technology is used to store data in the
  index?
  
  - object-oriented database

  - specifically built for them, not general-purpose

  - company they were using went bust, which caused a problem


PanData and the Research Data Alliance - Juan Bicarregui
--------------------------------------------------------

- A lot of policy activity in this area

- Sharing technologies across different projects/experiments

  - economy of scale benefits for technologies

  - cross-disciplinary benefits

  - users' expertise in one field can move to another easily

- Open Science agenda:

  - separate infrastructure from the process of science

  - researchers don't want to have to think about location of data,
    how it's curated etc.

- ICAT: dotted lines where they lose central control of the data
  (users running their own analyses etc.);  main focus of current work

  - tracking provenance

  - keep track of which software applicable to which data
    (cataloguing)

- Tomographic reconstruction: compute takes an order of magnitude
  longer than the scanning

  - want to reduce this to comparable time to scanning (using GPU
    cluster)

    - would then do more work e.g. see results of scan before doing
      the next one, to allow tuning of scan

    - number of scans might not change, but quality would improve

- Research Data Alliance

- metaphor with bridges, now use minimal vertical structures to
  support the horizontal structures


Economic and Social Science Research Data Landscape - Fiona Armstrong
---------------------------------------------------------------------

- Traditional social science is survey data

  - expect a seismic shift in how social science is done, reusing
    existing data

- Data not collected for social science purposes but can be
  used for them

- Admin data: information collected for the purposes of administration
  e.g. tax records

  - privacy concerns

  - pods in which research using these datasets carried out, placed at
    various unviersities

- Lots of interest in this communnity in research methods

  - NCRM

- CLOSER; potentially a 100-year project, 100,000 babies 

- lots of value in talking across different disciplines

- fiona.armstrong@esrc.ac.uk

- Question: Shaun De Witt - asking particle physicists to talk to
  anthropologists
  - referred to EUDAT project, which is aiming to do this, workshop in
    September, 25 different disciplines


Afternoon session
-----------------


Bioinformatics - Guy Coates
---------------------------

- cost of sequencing decreasing faster than Moore's law

- currently doing 10 terabases a week; 17,000x more than 10 years ago,
  but don't have 17,000x the budget

- USB attachable sequencers i.e. people can just turn up and try to
  use them

- strong emphasis on metadata for files, so people can find things
  they need

- a lot of data, rather than big data per se

- big data is computing across all the data

- field changes so quickly that time to development much more
  important than computational efficiency

- Question: has he considered private clouds in existing
  infrastructure?

  - would help with regulatory problems

  - can it be done as reliably and cost-effectively as public cloud
    providers?


ELIXIR project - Andrew Lyall
-----------------------------

- moving to distributed data access, instead of centralised

- can't keep growing at current rates

- old usage model of downloading dataset you want and processing it
  locally is breaking down now

  - Embassy Cloud testing now, to run compute at the data site

- data delivery in London, data production at Cambridge (roughly)

- many different modalities of usage as well as lots of different
  types of data

- slide 22 - EU provides orange, national level provides grey

- biology: the big challenges of big data Nature volume 498: pages,
  255-260

- Questions: I/O problems - is it access rate or streaming problem?

  - he's currently modelling these bottlenecks in software pipelines
    e.g. IOPs 

  - looks like pipelines generate comparatively large number of IOPs
    compared with CPU cycles


Big Data Requirements in Arts and Humanities - Andrew Prescott
--------------------------------------------------------------

- engagement with big data is one of the drivers of transforming
  scholarly practices in the humanities

- good engagement of digital artists with big data (possibly more than
  academics)

- new types of dialogue needed to achieve the transformations that are
  needed

- Questions: Jens - are there cultural barriers to these developments?

  - there have been commercial barriers e.g. licensing

  - there is now enthusiasm about possibilities of free use, but
    ignorance of how to do it

  - David: curation of social media

  - Andrew Lyall - ESFRI projects have been good for biology and may
    work for humanities (cf. CLARIN)

    - Biomed Bridges new EU projects and there's an equivalent one for
      social sciences


DDN GPFS - Vic Cornell
----------------------


HDFS - Steve Loughran - Hortonworks
-----------------------------------

- organised for workflows of streaming and processing

  - relax some traditional filesystem constraints to achieve this

- accept failures are inevitable

- operations teams are concerned about changes in the rates of
  failure, not individual failures per se

- topology-aware filesystem, so can ensure that file replicates are
  stored on different racks, different switches, different power
  supplies (given the correct failure domains)

- with latest Intel CPU parts, checksums can be computed in single CPU
  opcode, therefore very efficient

- moving towards topology-aware applications

- Big driver is where hardware is going and where desktops are going

  - laptops with SSDs will affect server world

  - less capacity, more bandwidth in servers by using laptop hard
    disks

    - this probably won't be possible in two years' time because all
      laptop disks will be SSDs

- Question: Guy Coates - people had problems with shuffle phase when running
  scientific applications

  - merge phase is network heavy

  - can do in-machine shuffles

  - best optimisation is to generate less data

- Question: what happens if a node dies?

  - missing blocks will be re-replicated

  - up to local site whether to e.g. fix the disk or just ignore it
    for now

  - care more about statistical failures than anything else

  - equivalent of worrying about a sector failing in a disk


Lustre - John Swinburne - Intel
-------------------------------


Ceph and Big Data - Patrick McGarry, Inktank
--------------------------------------------

- question: when will CephFS be production ready?

- question: Guy Coates- can I use different access methods at the same
  time?


CERN experiences with EOS, S3 and Ceph - Dirk Duellmann
-------------------------------------------------------

- split off disk system from archive system, because they're somewhat
  conflicting sets of requirements

- RAIN = nodes

  - erasure encoding
  
  - bring storage overhead down for parts of data not frequently
    accessed

- Apache Dynamo, Huawei system

- doesn't think single WAN system can be performant and reliable at
  the same time


Discussion
----------

- What about hotspotting of data?