166th ROOT Parallelism, Performance and Programming Model Meeting

Europe/Zurich
32/S-C22 (CERN)

32/S-C22

CERN

17
Show room on map
Marta Czurylo (CERN), Vincenzo Eduardo Padulano (CERN)
Videoconference
ROOT Team Meeting
Zoom Meeting ID
97374667082
Host
Axel Naumann
Alternative hosts
Bertrand Bellenot, Lorenzo Moneta, Danilo Piparo, Enrico Guiraud, Jakob Blomer, Vincenzo Eduardo Padulano
Useful links
Join via phone
Zoom URL

PPP 16.05.2024

HighLO

Last presentation 2 years ago. This presentation gives an update and does not require previous knowledge about the project.

Goal: research in markets manipulation

Limit Order

When a trader wants to buy something for a maxium price or selle something for a minimum price. Between the bid and the ask there is never an overlap, simply because when there is an overlap a transaction happens and the gap between bid and ask arises.

Every message updates the state of the limit order book at one particular limit order.

What kind of manipulation are we searching for?

  • Financial market is governed by supply and demand

Spoofing is a quick scheme of placing a huge limit order in the market and then expecting the price to react. Example trading in the gold market and saying "I will buy 10 M units of gold in the next 10 second", it's a self-fulfilling prophecy. The person who makes the statement can capitalize. This is quite hard to detect because we are not talking strictly about something illegal, but we have to prove the attempt of cancelling the orders just to manipulate the market.

Research goal:
* Describe how spoofing works
* Detect spoofing
* Help regulators and lawmakers such that they can actually predict the market

IMS

International Expert Group on Market Surveillance

Effort to create an international collaboration between european market regulators.

Into the microseconds

First research in finance that is able to find statistcal significant results in the order of microseconds.

  1. How do you measure relationships at the microsecond level?

E.g. wheat and corn market, they are correlated. Can we understand if their relationship can be modeled at the microsecond level?

Event based impact profile

Can we measure changes in a time series at fixed intervals before or after an event?

What is the average price change in the wheat market 50 ms after a price change in the corn market?

Price change in the wheat market = impact value
Price change in the corn market = trigger event

Results

From analyzing the correlation between corn and wheat market at the 10-100 microseconds time frames, we can see there is a two-step response curve between the trigger and the impact.

The second impulse is particularly interesting because it signifies the time of response of the market participants to the first one. Apparently this is in the ~200 microseconds region

Q: How can traders be so quick?
A: We don't have a confirmation, but it is common knowledge that high-frequency traders rent their machines right next to the market exchange servers and they go to full length into making sure they get their limit orders as fast as possible.

By inverting the trigger and the impact, we also discovered that there is an asymmetry. It is apparently the wheat market which always follows the corn market. Actually, we repeated this experiment with other items of the agricultural market, and it turns out that corn is really the driver for this market.

Interactive dashboard to explore high-dimensional histograms

We currently run a generation of our results into a (huge) PDF that contains both ROOT histograms as well as tables for all the markets and different features and properties we analyse.

But there are so many things we want to add that the PDF is limiting us. So the idea is to provide an online dashboard that allows to extract custom statistics interactively.

What are the advantages?
* Share results outside HighLO (e.g. with the IMS group). Sharing a PDF of O(1K) pages is impractical.
* Better guide the user in exploring our data and market features. This gives us an advantage w.r.t. native tools like the TTreeViewer because we can customize it to our needs.

RDataFrame support for time series

Q&A

Q: How deeply did you need to optimize the TTree storage?
A: Data we receive from the exchange is in a convoluted ASCII format. We do some restructuring when converted to ROOT file. Really long ASCII strings converted to 10 branches in a TTree. I am not working at the level of TTree clusters, but we do split at the weekly level. One TTree per commodity per week

Q: How interactive is the interactive dashboard?
A: Really new, everything is stored in sparse histograms. We can slice and dice in the postprocessing of the dashboard. If we go from the starting point of the sparse histograms and we have a backenc (currently Flask + PyROOT) which can interactively slice the histogram and project it. Then in the front end you can zoom in or select which dimensions to project.

Q: What is the experience with ROOT? Is all you need there? Do you feel some things should be upstreamed?
A:
* Time series in RDataFrame would be nice to have in ROOT, but I don't have time to put the effort there
* I tried in the past ML inference with ROOT (based on keras models) and it wasn't great.

Q: What is the goal of project 2, about RDataFrame?
A: To do analysis of time series data with RDF. At the moment this is not possible.
Q: That's fine, but why then wasn't it used in the other projects of market analysis?
A: We have in the HighLO project this framework on top of TTree that has been around for more than 5 years.

Q: How easy would it be to have a plugin system to have your custom actions?
A: The problem with RDataFrame is inserting more data entries.

Q: Running time of the analysis is around 5 days as you mentioned? Is that how long it takes for one histogram or the entire analysis?
A: Actually just one histogram because we run multiple TBs of data with one single thread.

Q: Mentioned TTree files and Keras, what were the difficulties?
A: I needed a VAE, my target was an array which back at the time was not possible. Having the target variable being an array was missing. There was another issue, I need to reshuffle my dataset for every epoch, which was not possible.

There are minutes attached to this event. Show them.