Extremely Large Database (XLDB) Europe 2013 continues the series of invitational workshops started in 2007 with a satellite event at CERN.
The two day workshop will bring together key experts from data-intensive sciences in Europe and beyond for a discussion of real-world petabyte-scale data management and analysis use cases with the aim to foster exchange among the many emerging "big data" communities.
Borut Paul Kersevan
(Jozef Stefan Institute (SI))
Borut Paul Kersevan
(Jozef Stefan Institute (SI)), Chris Roderick
(CERN), Fons Rademakers
(CERN), Ian Bird
(CERN), Ian Fisk
(Fermi National Accelerator Lab. (US)), DrMarkus Schulz
(CERN), Philippe Charpentier
(CERN), Predrag Buncic
(CERN), Wahid Bhimji
(University of Edinburgh (GB))
Coffee, Tea & Demos60/6-015 - Room Georges Charpak (Room F)
Data management challenges of in-silico neuroscience
(EPFL - Blue Brain)
Data-driven Neuroscience: Enabling Breakthroughs Via Innovative Data Management
Scientists in all disciplines increasingly rely on simulations to develop a better understanding of the subject they are studying. For example the neuroscientists we collaborate with in the Blue Brain project have started to simulate the brain on a supercomputer. The level of detail of their models is
unprecedented as they model details on the subcellular level (e.g., the neurotransmitter). This level of detail, however, also leads to a true data deluge and the neuroscientists have only few tools to efficiently analyze the data.
This demonstration showcases three innovative spatial management solutions that have substantial impact on computational neuroscience and other disciplines in that they allow to build, analyze and simulate bigger and more detailed models. More particularly, we visualize the novel query execution strategy of FLAT, an index for the scalable and efficient execution of range queries on increasingly detailed spatial models. FLAT is used to build and analyze models of the brain. We furthermore demonstrate how SCOUT uses previous query results to prefetch spatial data with high accuracy and therefore speeds up the analysis of spatial models. We finally also demonstrate TOUCH, a novel in-memory spatial join, that speeds up the model building process.
Genome, Genomics and Metagenomics : Sequencing the Universe
With the advance in DNA sequencing technology the amount of data generated by life and medical scientists are faster than the ability to analyse them. Small universities and group are facing the problem that large sequencing center were facing 15 years ago. This require major paradigm shift in the way we represent, mine and analyze genomic (-omics) data. Knowing that despite the fact that DNA is encode by merely 4 letter-code the variability, diversity and complexity of its assemblage is far beyond our current understanding. It also requires rethinking the way we consider a genome a long string and we should cope now with variation at a single individual (for a human genome), a bacteria (cultured) and the environment (metagenomics). The Vital-IT group of the SIB Swiss Institute of Bioinformatics is tackling such major challenges and provide both the competencies in bioinformatics but also the required infrastructure to structure, curate, maintain and distribute the data and knowledge for the life and medical science community.
Surfing the Tsunami: Biology data and EBI's infrastructure
(EMBL-EBI), Manuela Menchi
LunchCERN Restaurant 1
CERN Restaurant 1
Earth Science - Big Data Analytics60/6-015 - Room Georges Charpak (Room F)
This presentation will describe the current situation of numerical weather prediction, the challenges and the opportunities which result from the goal to produce even better weather forecasts for
even longer times. Factors affecting the accuracy of numerical weather prediction models include the density and quality of
observations used as input to the models along with deficiencies in the models themselves. Extremely small errors in temperature, winds, or other initial inputs given to numerical weather prediction models will amplify and double after a couple of days. This makes it nearly impossible to seriously predict the state of the atmosphere for a period longer than two weeks. Furthermore, existing observation networks have poor coverage in some regions or over large bodies of
water such as the Pacific Ocean, which introduces uncertainty into the true initial state of the atmosphere. Therefore more and more sensor data from weather satellites are used for numerical weather prediction.
On one hand, the increasing power of supercomputers makes it possible to feed numerical forecast models with more and more input data. On the other hand, this results in the production of more and more output data due to a higher spatial and temporal resolution of the models, longer forecast periods, as well as new algorithms and
computation methods. For example, the ensemble forecasting method involves analyzing multiple forecasts created with an individual forecast model by using different physical parametrisations or varying initial conditions. This results to the creation of big data amounts.
The ICON (ICOsahedral Non-hydrostatic) model is one of DWD's numerical weather prediction models which currently runs 4 times a day. It computes a global forecast on a 20km grid. The output size of one model run is greater than 135 GB. In 2015 the spatial resolution of this model will be changed from a 20km grid to a 10km grid. Hence, the model output will be 4 times bigger. The ICON model will then produce more than 2 TB geospatial data per day. Whether this data could be efficiently used for the creation of new products depends on the existence of adequate tools and techniques for managing and analyzing big geospatial data.
Turning the Ship
[Getting] Big data from Planetary Science and Exploration and the EarthServer/PlanetServer approach
Planetary data deriving from multiple missions and space agencies are approaching the order of magnitude of Earth Remote Sensing counterparts. By 2015 we estimate there will be over a petabyte of planetary data. Most, if not all of those data are freely available to the community, although the availability or access to processing routines required to prepare the data for scientific analyses varies per instrument team or mission.
Not only is the volume of data extremely challenging, also the complexity of the data continues to require more complex processing methods. And while the raw or low-level data sets are normally archived using defined standards [e.g. 1,2,3], the availability of web services and server-based processing through client-based analysis could ease current and future scientific use.
The Planetary Service (PlanetServer)  of EarthServer  aims at enabling and easing planetary science orbital data exploitation by using the Open Geospatial Consortium Web Coverage Processing Service . Its content is Mars-centric, but a similar approach can be extended to other planetary bodies, in addition to platforms other than orbital ones.
Use cases are currently oriented towards surface imaging and remote compositional studies, but broader and more diverse use cases including surface, subsurface and atmospheric focus are being evaluated.
Recent, current and future non-orbital planetary exploration platforms, such as landers, rovers, unmanned aerial vehicles, balloons, are and will be delivering data even more complex to handle than satellite or airborne multi-dimensional imagery. While archives will evolve to match future challenges [e.g. 6], there will be the need for higher-level processed data, to be analyzed in large amounts and with limited time and human resources.
The planetary community and the larger scientific community can benefit from the wider availability of exploitation tools enabling (publishable) science through the use of public planetary mission data.
Figure 1: Logarithmic comparison of estimated total data volumes per mission for Mars (red) and The Moon (blue). Only a selected set of missions are shown.
 McMahon, S. K. (1996). Overview of the planetary data system. Planetary and Space Science, 44(1), 3-12.
 IPDA, International Planetary Data Alliance, accessed April 2013 http://planetarydata.org
 Oosthoek, J. H. P., Flahaut, J., Rossi, A. P., Baumann, P., Misev, D., Campalani, P., & Unnithan, V. (2013). PlanetServer: Towards Online Analysis of Integrated Planetary Data. Lunar Planet. Sci XXIV, 1719, 2523.
 PlanetServer, Planetary Servive of EarthServer, accessed April 2013 http://planetserver.eu
 EarthServer Project Portal, accessed April 2013 http://earthserver.eu
 Crichton, D. (2012) PDS4: Developing the Next Generation Planetary Data System, 2nd Planetary Data Workshop, Flagstaff, Arizona, USA, June 25-29 2012.
DrAngelo Pio Rossi
(Jacobs University Bremen)
Oceanography and Earth Observation
Coffee, Tea & Demos60/6-015 - Room Georges Charpak (Room F)
A typical work flow for data analysis in R consists of the following steps: First load the raw data from file, then select and transform raw data into a form suitable for statistics, and then apply a statistical algorithm and visualization. However, the amount of data that can be analyzed using this process is limited by the amount of memory on the system on which R is run, which are typically desktop computers. A logical next step to mend this problem is to store the raw data in a relational database system. The standard process is now modified by not loading the raw data into R, but instead to load it into a database. Then, one can "outsource" the selection of data relevant to the analysis as well as basic calculations and aggregations to a highly optimized database system.
R's database interface (DBI) provides a generic way of communicating with a relational database. Packages such as RPostgreSQL implement a specific driver for a particular database. However, not all relational databases are equally well suited to support statistical calculations. Transformation procedures and simple calculations make recommending a relational database optimized for "On-line analytical processing" (OLAP) rather obvious. Furthermore, R's calculations on statistical observations are typically performed column-wise. Hence, only a fraction of columns are actually processed at a given time. These factors together suggest a column-oriented database design. MonetDB, an open-source column-oriented database system, implements this design. We have created the MonetDB.R package, which implements a native DBI driver to connect R with MonetDB.
However, in order to tell the database which data is to be transferred to R, a user still is required to write queries in the standardized Structured Query Language (SQL), which breaks work flows and increases training requirements. We went one step further and implemented a virtual data object. This monet.frame object is designed to behave like a regular R data.frame, but does not actually load data from MonetDB unless absolutely required. For example, consider the following interaction: mean(subset(mf,c1 > 42)\$c2). We select a subset of the mf object based on a filter condition on the c1 column. Then, we average of the c2 column. However, in this case the mf variable points to an instance of our virtual data object backed by a MonetDB table t1. Our implementation automatically generates and executes a SQL query: SELECT AVG(c2) FROM t1 WHERE (c1>42);. Instead of loading the potentially large table, we only transfer a single scalar value. Also, through the columnar storage layout of MonetDB, only the files that contain the data for columns c1 and c2 actually have to be accessed.
Our approach has two major advantages: Users are not exposed to SQL queries at all, and only data relevant to the analysis are loaded int R, which results in huge performance improvements. monet.frame is part of MonetDB.R, and we invite all those interested to take part in its evolution.
SciQL: Array Data Processing Inside an RDBMS
Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we have introduced SciQL (pronounced as ‘cycle’) in . SciQL is the first SQL-based declarative query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence- interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., group array elements based on their positions. This leads to a generalisation of window-based query processing with wide applicability in science domains.
In this demo, we showcase a proof of concept implementation of SciQL in the relational database system MonetDB. First, with the Conway’s Game of Life application implemented purely in SciQL queries, we demonstrate the storage of arrays in the MonetDB as first class citizens, and the execution of a comprehensive set of basic operations on arrays. Then, to show the usefulness of SciQL for real-world array data processing use cases, we demonstrate how various common image processing and remote sensing operations are executed as SciQL queries. The audience is invited to challenge SciQL with their use cases.
Demo: The rasdaman Array Analytics Engine
Rasdaman ("raster data manager") is an Array DBMS supporting the model of large multi-dimensional arrays as well as declarative operations on them. Over its more than 15 years lifetime, rasdaman has evolved, based on its Array Algebra, into a fully implemented system offering array query support, storage and processing optimization, parallel evaluation, and further Big Data enabling features.
Being the reference implementation for the OGC WCS geo service it particularly supports spatio-temporal semantics as needed for large-scale geo services. Based on real-life data and use cases we present the current state of rasdaman, which Rona Machlin at ACM SIGMOD has characterized as "most comprehensively implemented system" of its kind. For many use cases, visual clients are available and can readily be used by Internet connected participants.
Array Analytics - Concepts, Codes, and Challenges
(This talk is proposed for the Array Analytics session)
After a long period of neglection by database research, arrays now are recognized as a core data structure in science and engineering domains, and actually as a main representative of the Big Data there.
However, it is not only about array data and accessing them - today requirements on server-side processing capabilities are high, often transcending the classical query language concepts. Therefore, Array Database research is not bound to traditional database conceptualizations and is tightly intertwining itself with related domains like image and signal processing, statistics, supercomputing, and visualization, thereby justifying the more general characterization of Array Analytics. This recognition has sparked research and implementations such as rasdaman, SciQL, SciDB, and PostGIS Raster.
In our talk, we give an overview on the field of Array Analytics from a database perspective. We address formalisms, discuss different conceptualizations like "array-as-table" and "array-as-attribute", exemplify array querying and optimization, and present architectural approaches of array storage and processing.
Real-life use cases illustrate relevance. Finally, standardization efforts on Array Analytics are inspected. In doing so, we spot open issues and research directions.
Lightning Talks60/6-015 - Room Georges Charpak (Room F)
Big Data processing has be come more and more commodity in the last years. Specialised systems for different purposes have emerged and proven to be of great benefit in their area.
This reaches from key-value stores in all their different flavours (pure key-value, document, tablet ..), graph databases, peta-scale storage systems, CEP-systems and all the above. Most of these system solve the volume problem of the Big Data dimensions but fall short to deliver analytical results in real-time. To make is even more complicated these systems have emerged from open-source project as well as from commercial providers. In this lightning talk we would like to present an outlook how an integrated architecture, that allows for the integration of all these systems and at the same time adds the ability to integrate real-time analytical systems like ParStream, could look like.
Architectures for Massive Parallel Data Base Clusters providing Linear Scale-Out and Fault Tolerance on Commodity Hardware for OLTP Workloads
Apache Hadoop seems to evolve to the de-facto standard for large scale OLAP processing. Ignoring the overhead of three fold data block redundancy and inter-node communication between the Shuffle
and Reduce phase Hadoop introduces significant drawbacks when it comes to random data access patterns and concurrency. Although the Brewer's theorem implies that Consistency, Availability and
Partition Tolerance on a Distributed Information System is never achievable at the same time this talk shows how to build an Information System leveraging Linear Scale-Out and Fault Tolerance
on the one side and Concurrency and Consistency on the other side. After basic information management concepts like data locality, concurrency and recovery are introduced in the context of shared disks clusters this talk focuses on applying these concepts on real world scenarios based on IBM's GPFS based pureScale technology.
Large File System Optimisation and User Education
The storage volume in the CERN Computing Center is growing constantly and exceeded 100 PB in February 2013. To increase the efficiency of analysis tasks, the EOS storage system has been developed, which is optimized to handle random access to physics data. The current setup is running with a disk volume of 40 PB and more than 1000 users.
We analyse file related metrics such as throughput, read-ratio or reopen-ratio which can be obtained server-side. Based on first measurements of user-system interaction, we see multifaceted access characteristics. Inefficient file access can be identified such as up to thousands of re-opens and re-reads of a file. Inspired by database query improvement research, we have analysed the typical user access patterns to improve the system. This includes the modification of the software and configuration parameters such as the buffer size. In addition, it includes the improvement of the user access pattern as well. For that we want to provide additional usage and performance metrics which will help each user to validate the efficiency of their own code.
Real-Time Analytics on Massive Time Series
Large-scale critical infrastructures such as transportation, energy, or water distribution networks are increasingly equipped with smart sensor technologies. Low-latency analytics on the resulting times series would open the door to many exciting opportunities to improve our grasp on complex urban systems. However, sensor-generated time series often turn out to be noisy, non-uniformly sampled, and misaligned in practice, making them ill-suited for traditional data processing. In this paper, we introduce TRISTAN (massive TRIckletS Time series ANalysis), a new data management system for efficient storage and real-time processing of fine-grained time series data. TRISTAN relies on a dedicated, compressed sparse representation of the time series using a dictionary. In contrast to previous approaches, TRISTAN is able to execute most analytical queries on the compressed data directly, and supports efficient and approximate query answering based on the most significant atoms of the dictionary only. We present the overall architecture of our system and discuss its performance on several smarter city datasets, showing that TRISTAN can achieve up to 1:100 compression ratios and 250x speedup compared to a state-of-the-art system.
(U. of Fribourg)
Big Data Analytics with Stratosphere: A Sneak Preview
In this talk I will give a sneak preview of Stratosphere, an open-source software stack for parallel analysis of "Big Data". Stratosphere combines features from relational DBMSs and MapReduce: it enables "in situ" data analysis using user-defined functions, declarative program specification, and automatic program optimization, covering a wide range of use cases, from data warehousing to information extraction and integration. Further, Stratosphere covers use cases such as graph analytics and Machine Learning by integrating support for iterative programs in the system's optimizer and runtime engine. In particular, I will highlight the need for declarative languages and automatic parallelization and optimization of complex data analysis programs that involve iterative computation, and show how to achieve this using a combination of database query optimization and compiler techniques.
The Parallel Universe Effect between Data-Intensive Physics and Banking
The concept of the parallel universe is known both in philosophy and physics and is sometimes used to describe concurrent physical phenomena that occur in isolation without communication. In this talk we investigate the concept from a completely different angle. We introduce the parallel universe effect of two data-intensive disciplines, namely physics and banking.
At the first glance this comparison might seem absurd. Apart from the fact that both high-energy physics (in particular high-energy physics at CERN) and banking are major players in beautiful regions in the foothills of the Swiss Alps, the differences between these two fields seem to be significantly larger than their similarities. So why should we be interested in entangling these universes?
Common to both communities is that they face the challenge of large-scale data exploration. However, the approaches of these parallel universes could not be more different. While the banking industry has the credo “buy commodity solutions and build only if necessary”, the physics community has often exactly the opposite stance, namely “build tailor-made systems that scale far beyond currently available enterprise solutions”. One example is the distributed data store called “ROOT” which originated at CERN and is now developed at high-energy physics labs across the globe ranging from Japan to California. This data store currently holds several Petabytes of data and is one of the largest data stores in the world. However, due to the parallel universe effect, that system is hardly known to the data-intensive banking community in particular and the enterprise computing community in general. Banking is so much more familiar with Oracle, IBM-DB2 or SQL Server – to name only a few – rather than with extreme-scale systems being built in the physics community.
In this talk we shine a light on these parallel universes and explain their fundamental building blocks. The idea is not necessarily to merge these data-intensive communities but to engage them in a discussion where both sides could harvest fruits from each other and thus bring them to a higher orbital state.
Workshop DinnerAuberge de Satigny
Auberge de Satigny
the bus is leaving at 18:30 from the parking behind the main building
The Venice Time Machine, a joint project between EPFL and the University of Venice aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. Venice was Europe's economic hub for centuries, a door to the Orient and dominated the seas. The digitalization of hundreds of kilometers of Venetian archives, itself a daunting task already underway, creates challenges for data management, image and character recognition. But the central scientific challenge of this big data project is to qualify, quantify and represent uncertainty at each step of this digitisation and modelling process.
FuturICT - A CERN for the Social Sciences?
FuturlCT is a global initiative pursuing a participatory approach, integrated across the fields of ICT, the social sciences and complexity science, to design socio-inspired technology and develop a science of global, socially interactive systems. The initiative wants to bring together, on a global level, Big Data, new modelling techniques and new forms of interaction, leading to a new understanding of society and its co-evolution with technology. The goal is to create a major scientific drive to understand, explore and manage our complex, connected world in a more sustainable and resilient manner.
FuturICT is motivated by the fact that ubiquitous communication and sensing blur the boundaries between the physical and digital worlds, creating unparalleled opportunities for understanding the socio-economic fabric of our world, and for empowering humanity to make informed, responsible decisions for its future. The intimate, complex and dynamic relationship between global, networked ICT systems and human society directly influences the complexity and manageability of both. This also opens up the possibility to fundamentally change the way ICT will be designed, built and operated, reflecting the need for socially interactive, ethically sensitive, trustworthy, self-organised and reliable systems.
FuturICT wants to build a new public resource - value-oriented tools and models to aggregate, access, query and understand vast amounts of data. Information from open sources, real-time devices and mobile sensors would be integrated with multi-scale models of the behaviour of social, technological, environmental and economic systems, which could be interrogated by policy-makers, business people and citizens alike. Together, these would build an eco-system leading to new business models, scientific paradigm shifts and more rapid and effective ways to create and disseminate new knowledge and social benefits – thereby forming an innovation accelerator.
FuturICT would create a “Planetary Nervous System” (PNS) to orchestrate a high-level, goal driven self-organised, collection and evaluation of Big Data generated from sources such as social media, public infrastructures, smart phones or sensor networks. The aim is to create an increasingly detailed “measurement” and a better understanding of the state of the world. For this, the sensing concept used in the physical and environmental sciences would be combined with machine learning and semantic technologies and extended to social and economic contexts. The information provided by the Planetary Nervous System would fuel the development of more realistic, and eventually, global scale models that bring data and theories together, to form a “Living Earth Simulator” (LES) enabling the simulation of “what if …” scenarios. The LES would reveal causal interdependencies and visualise possible short-term scenarios, highlight possible side effects and test critical model assumptions. The “Global Participatory Platform” (GPP) would open up FuturICT’s data, models, and methods for everyone. It would also support interactivity, participation, and collaboration, and furthermore provide experimental and educational platforms. The current activities to develop a “Global System Science” (GSS) will lay the theoretical foundations for these platforms, while the focus on socio-inspired ICT will use the insights gained to identify suitable designs for socially interactive systems and the use of mechanism that have proven effective in society as operational principles for ICT systems. FuturICT’s “Exploratories” will integrate functionalities of the PNS, LES, and GPP, and produce real-life impacts in areas such as Health, Finance, Future Cities, Smart Energy Systems, and Environment. Furthermore, the “Innovation Accelerator” (IA) will develop new approaches to accelerate inventions and innovations. A strong focus on ethics will cut across all activities and develop value-sensitive ICT. Targeted integration efforts will push towards the creation of a powerful and integrated ICT platform that puts humans in the centre of attention.
Panel: Big Data in Industry60/6-015 - Room Georges Charpak (Room F)
In the demo I will present the new generation SkyQuery, a tool designed for astronomers working on multi-wavelength projects that require cross-matching celestial objects across multiple multi-TB catalogs. Cross-match problems are formulated in a slightly extended version of SQL. Stars and galaxies are associated based on spherical coordinates affected by measurement errors. A Bayesian approach is used to determine the appropriate cuts on spherical distances when evaluating possible matches. Cross-matching of more than two catalogs is done iteratively, one catalog at a time, which makes our algorithm scalable. We built our system on a cluster of servers running Microsoft SQL Server. A generic-purpose data warehouse API was developed to manage the cluster servers and execute distributed and partitioned queries in parallel on multiple machines for load balancing. The API includes modules for SQL parsing, workflow management, job queuing, distributed query execution and web-based user interfaces for system managers and end-users.
Extending the ATLAS PanDA Workload Management System for New Big Data Applications
The LHC experiments are today at the leading edge of large scale distributed data-intensive computational science. The LHC's ATLAS experiment processes data volumes which are particularly extreme, over 130 PB to date, distributed worldwide at over of 120 sites. An important element in the success of the exciting physics results from ATLAS is the highly scalable integrated workflow and dataflow management afforded by the PanDA workload management system, used for all the distributed computing needs of the experiment. The PanDA design is not experiment specific and PanDA is now being extended to support other data intensive scientific applications. Alpha-Magnetic Spectrometer, an astro-particle experiment on the International Space Station, and the Compact Muon Solenoid, an LHC experiment, have successfully evaluated PanDA and are pursuing its adoption. PanDA was cited as an example of "a high performance, fault tolerant software for fast, scalable access to data repositories of many kinds" during the "Big Data Research and Development Initiative" announcement, a $200 million U.S. government investment in tools to handle huge volumes of digital data needed to spur science and engineering discoveries. In this talk, a description of the new program of work to develop a generic version of PanDA will be given, as well as the progress in extending PanDA's capabilities to support supercomputers, clouds, leverage intelligent networking, while accommodating the ever growing needs of current users. PanDA has already demonstrated at a very large scale the value of automated data-aware dynamic brokering of diverse workloads across distributed computing resources. The next generation of PanDA will allow many data-intensive sciences employing a variety of computing platforms to benefit from ATLAS' experience and proven tools in highly scalable processing.
(Brookhaven National Laboratory (US))
Evaluation of some LSST Queries: Preliminary Results
In many scientific fields, such as physics, astronomy, biology or environmental science, the rapid evolution of data acquisition tools (e.g., sensors, satellites, cameras, telescopes) as well as the extensive use of computer simulations have led in recent years to an important production of data. Modern scientific applications are then facing with new problems that are primarily related to the management and use of such data. In addition to the growing volume of data to handle, their complex nature (e.g., images, uncertain data, multi scale,...), the heterogeneity of their formats and the various processing to which they are subject are the main sources of difficulties. The problems are such as scientific data management was recognized as a real bottleneck which slows down scientific research since it relies more and more on the analysis of massive data. In this context, the role of the computer as a direct way to improve the discovery process in science is important.
As part of its mission, CNRS, in the framework of PetaSky project, we conducted experiments on PT1.1 data set in order to compare the performances of both centralized and distributed database management systems. Regarding centralized systems, we have deployed three different DBMSs: Mysql, Postgresql and DBMS-X (a commercial relational database). Regarding distributed systems, we have deployed HadoopDB. The goal of these experiments is to report on the ability of these systems to support LSST requirements from data management perspective. We mainly analyzed issues related to performance, speed up, fault tolerance and latency.
All Roads lead to Rome: How Roman Generals would do Large-Scale Machine Learning
Collecting use cases
XLDB events attempt to expose how
different communities manage and analyze their
data sets, share lessons learned, pains, find
commonalities and more. Due to time constraints
the discussions at the XLDB events are only
able to scratch the surface. Additionally,
it'd be extremely useful to document all the use
cases, publicize them, expose to these not
present at our events and more. Solution? We should
be all documenting our use cases. The lightning
talk will walk you through the new use case wiki,
Novel Data Compression Methods for Scientific Data
A data compression method which is specifically tailored for scientific data can simultaneously tackle the storage, networking and I/O bottlenecks. This talk will introduce three novel techniques for data compression of specific relevance to CERN-style scientific data, and show how the correct choice of data compression technique can even help the scientific method. This is intended to be a light hearted talk, and will not involve heavy mathematics.
Delite: Domain Specific Languages for Big Data and Heterogeneous Parallelism
Delite is a compiler framework and runtime for building high-performance DSLs, developed at Stanford and EPFL. Delite DSLs are embedded in Scala to provide a high-level programming model, but generate fast, low-level code for heterogeneous targets using runtime code generation techniques (lightweight modular staging, LMS). With a recently added cluster backend that also includes GPU execution, machine learning kernels written in the OptiML DSL shows speedups of up to 7x over Spark and up to two orders of magnitude over Hadoop.
(Oracle Labs / EPFL)
As data collections become larger and larger, users are faced with growing bottlenecks in their data analysis. One such bottleneck is the time to prepare and load data into a database system, which is required before any queries can be executed. For many applications, this data-to-query time, i.e. the time between first getting the data and retrieving its first meaningful results, is a crucial barrier, and a major reason why many applications already avoid using traditional database systems altogether. As data collections grow, however, the data-to-query time will only grow.
In this demonstration, we will showcase a new philosophy for designing database systems called NoDB. NoDB aims at minimizing the data-to-query time, most prominently by removing the need to load data before launching queries. We will present our prototype implementation, PostgresRaw, built on top of PostgreSQL, which allows for efficient query execution over raw data files with zero initialization overhead. We will visually demonstrate how PostgresRaw incrementally and adaptively touches, parses, caches and indexes raw data files autonomously and exclusively as a side-effect of user queries. Moreover, we will demonstrate with "live races" how PostgresRaw outperforms traditional database systems across a variety of workloads.
Big Data Processing on Modern Architectures60/6-015 - Room Georges Charpak (Room F)
The trends of technology are rocking the storage industry. Fundamental changes in basic technology, combined with massive scale, new paradigms, and fundamental economics leads to predictions of a new storage programming paradigm. The growth of low cost/GB disk is continuing with technologies such as Shingled Magnetic Recording and jumps in tape density. Flash and RAM are continuing to scale with roadmaps, some argue, down to atom scale.
These technologies do not come without a cost.
It is time to reevaluate the interface that we use to all kinds of storage, RAM, Flash and Disk. The discussion starts with the unique economics of storage (as compared to processing and networking), discusses technology changes, posits a set of open questions and ends with predictions of fundamental shifts across the entire storage hierarchy.
Big And Fast Data in the City
Cities are increasingly seen as the crucibles where the success or failure of our society is determined. The Smarter Cities vision is to bring a new level of intelligence to how the world works — how every person, business, organization, government, natural system, and man-made system interacts. IBM Research’s Smarter Cities Technology Centre is conducting research to make cities more efficient, productive, and enjoyable by leveraging the big and fast data generated by cities, their citizens, and their utilities. In this talk, I will talk about our research into creating technology to continuously assimilate diverse and noisy data sources for better awareness and prediction, to model how humans use city infrastructure and infer demand, to simultaneously simulate hundreds of thousands of energy demand forecast models for Smart Grid, to factor uncertainty and risk into optimized planning and operations, and to organise open data and knowledge to engage citizens, empower universities, and enable business.
Susara van den Heever
The ATLAS Distributed Data Management System & Databases
The ATLAS Distributed Data Management (DDM) System is responsible for the global management of petabytes of high energy physics data. The current system, DQ2, has a critical dependency on Relational Database Management Systems (RDBMS), like Oracle. RDBMS are well-suited to enforcing data integrity in online transaction processing applications, however, concerns have been raised about the scalability of its data warehouse-like workload. In particular, analysis of archived data or aggregation of transactional data for summary purposes is problematic. Therefore, we have evaluated new approaches to handle vast amounts of data. We have investigated a class of database technologies commonly referred to as NoSQL databases. This includes distributed filesystems, like HDFS, that support parallel execution of computational tasks on distributed data, as well as schema-less approaches via key-value stores, like HBase.
In this talk we will describe our use cases in ATLAS, share our experiences with various databases used in production and present the database technologies envisaged for the next-generation DDM system, Rucio. Rucio is an evolution of the ATLAS DDM system which addresses the scalability issues observed in DQ2.
Summary & Adjourn60/6-015 - Room Georges Charpak (Room F)