Parsl live-coding speaker notes - PyHEP 2019
http://parsl-project.org - "Use Parsl to create parallel programs comprised of Python functions and external components. Execute Parsl programs on any compute resource from laptops to supercomputers."
please interrupt with questions at any point
Me:
I'm not a HEP person. But I do work with one and she's the one that sent me here.
I'm a software engineer - I work both in industry and academia - in addition to Python, my other big language is Haskell, which leads to me being something of a category theorist too...
My driving application for my parsl work at the moment is LSST - the Large Synoptic Survey Telescope under construction in Chile.
Other people here have talks about what parsl gets used for in HEP which is good because that's an area that I'm weak on.
I'm going to do some live coding here to show the basic programming model, and use that as a base for further discussion.
# first I did `pip install parsl`
import parsl
parsl.load()
help(parsl) # or online: https://parsl.readthedocs.io/en/latest/
import time
def pi_estim_A():
time.sleep(5)
return 4
pi_estim_A()
@parsl.python_app
def pi_estim_B():
time.sleep(5)
return 4
future = pi_estim_B()
type(future)
future.result()
# so here we get some concurrency: this will take 5 seconds, not 10 seconds...
# after we've launched the first call which returns immediately, we can then go on to do other stuff
# such as launch the second call.
# and then we'll block only when we try to get the result - that first result() call will take 5s,
# but the second call is probably ready already
f1 = pi_estim_B()
f2 = pi_estim_B()
(f1.result() + f2.result())/2
# now move towards better pi estimation...
# circle inscribed in a square... pick points.
import random
coords = [(random.random(), random.random()) for _ in range(1,10)]
coords
@parsl.python_app
def pi_estim_D( coords ):
time.sleep(2)
(x,y) = coords
return 4
fs = list(map(pi_estim_D, coords))
# can run this repeatedly and watch the list slowly go from running state to finished state over about 10 seconds
fs
rs = [f.result() for f in fs]
rs
# now implement circle inside square
import math
@parsl.python_app
def pi_estim_E( coords ):
(x,y) = coords
if math.sqrt(x*x + y*y)>=1:
return 0
else:
return 4
fs = list(map(pi_estim_E, coords))
rs = [f.result() for f in fs]
rs
sum(rs) / len(rs)
@parsl.python_app
def avg(*args):
return sum(args)/len(args)
# i'll pass in the *futures* here, not the results
# and parsl will block until all those futures are done (which they are)
# and then run the code
af = avg(*fs)
type(af)
af.result()
# so put this together: we can launch 500 pi_estim_E and the avg all at once, and it will put things into the
# right order but parallelised.
# this is the bit to really explain hard: we get a future out the end, but that future won't get its result
# until all of the first 500 futures have completed and then the avg code runs.
# there's a particular kind of concurrency here which is more constrained than (for example) threads or MPI,
# but (usually) easier to reason about/debug
def estimate_pi():
coords = [(random.random(), random.random()) for _ in range(1,500)]
fs = list(map(pi_estim_E, coords))
return avg(*fs)
f = estimate_pi()
f
f.result()
estimate_pi().result()
this above is all on my laptop at the start, ran parsl.load(). i could instead pass in a configuration here that describes how to run on other systems - that can describe how to connect to that machine, how to submit a batch job, where the working directories are, ... above calls would be unchanged but we'd get remote execution on a cluster.
support classic batch systems like Slurm, torque, cobalt... also more cloud-like stuff like kubernetes clusters or Amazon web services.
can also launch @bash_apps which are shell commands - that's pretty common to launch tools that are already packaged as a CLI rather than a python library - an example is some astronomy simulation running on 2000 nodes, running jobs that run for 12 or more hours on and off for a number of months
someone allegedly has run a million tasks in one parsl run.
parsl can also manage those decorated calls in other ways that aren't just parallelisation:
There's also work we've done with packaging up environments for your remote execution:
There are a number of APIs for plugging in different kinds of backends to support different systems below parsl, if you want to plug in the back, not write workflow stuff on the front: support different execution platforms, file transfer systems, batch systems, ...
Community: couple of weeks ago we had ParslFest in Chicago! about 36 people there. Pretty broad range of science areas that people are using parsl - this is not a specific HEP tool. http://parsl-project.org/parsl_meeting
We're on github https://github.com/Parsl/parsl and there is a slack (invite on the main website) with a #parsl-help channel.
Also, we don't support python2.
Also, stickers!