The LHCb experiment stores around 10^11 collision events per year. A typical physics analysis deals with a final sample of up to 10^7 events. Event preselection algorithms (lines) are used for data reduction. They are run centrally and check whether an event is useful for a particular physical analysis. The lines are grouped into streams. An event is copied to all the streams its lines belong, possibly duplicating it. Due to the storage format allowing only sequential access, analysis jobs read every event and discard the ones they don’t need.
This scheme efficiency heavily depends on the streams composition. By putting similar lines together and balancing the streams sizes it’s possible to reduce the overhead. There are additional constraints that some lines are meant to be used together so they must go to one stream. The total number of streams is also limited by the file management infrastructure.
We developed a method for finding an optimal streams composition. It can be used for different cost functions, has the number of streams as an input parameter and accommodates the grouping constraint. It has been implemented using Theano  and the results are being incorporated into the streaming  of the LHCb Turbo  output with the projected analysis jobs IO time decrease of 20-50%.
 Theano: A Python framework for fast computation of mathematical expressions, The Theano Development Team
 Separate file streams https://gitlab.cern.ch/hschrein/Hlt2StreamStudy, Henry Schreiner et. al
 The LHCb Turbo Stream, Sean Benson et al., CHEP-2015
|Primary Keyword (Mandatory)||Distributed data handling|
|Tertiary Keyword (Optional)||Data processing workflows and frameworks/pipelines|
|Secondary Keyword (Optional)||Distributed workload management|