Speaker
Benjamin Galewsky
(Univ. Illinois at Urbana Champaign (US))
Description
Cloud data lake technologies have been used successfully in industry for analysis of exabyte scale datasets. The technologies that underly this architecture are
- Object Store
- Parquet file format
- Kubernetes
- Distributed SQL
We will describe our work using a Trino distributed SQL engine to join selected event data with inference results. We will show how this architecture can eliminate the need to maintain analysis specific copies of datasets.
Requested talk length | 20 |
---|
Author
Benjamin Galewsky
(Univ. Illinois at Urbana Champaign (US))
Co-author
Nick Manganelli
(University of Colorado Boulder (US))