Speaker
Description
The rise of industrial "AI Factories," exposes a critical bottleneck that transcends raw storage performance: managing the data itself.
As AI/ML pipelines ingest and process vast, heterogeneous datasets, the complexity of data discovery, lineage, and governance becomes a primary inhibitor to scaling operations. Traditional storage systems fail to answer vital questions: "Where is the verified, compliant dataset for training?", "What is the exact data lineage of this deployed model?", and "How do we optimize data placement across a distributed infrastructure?"
This presentation details the design and implementation of the intelligent metadata catalog at the heart of the European project DaFAB. We demonstrate that a metadata-driven approach is the key to unlocking efficient, reproducible, and sovereign AI. We will cover: (1) The use of semantic search and active metadata for federated data discovery; (2) Automated data lineage and versioning to ensure model reproducibility and compliance; and (3) How the catalog integrates with the storage layer to orchestrate data movement, respecting data gravity to optimize for performance and cost.
By treating metadata as a primary asset, the DaFAB catalog transforms the storage infrastructure from a passive repository into an active, intelligent component of the modern AI factory. The metadata catalog underlying technology used in DaFab, is the Rucio open-source technology originated from CERN. We will conclude by drawing some perspective with industrial solutions.
| Suggested Contribution Type | Regular Talk (15-30 min) |
|---|