B-hive Conductor is a solution for giving enterprises visibility and control for web-enabled transactions. As much as our clients consider web application errors and underperformance a fact of life, it turns disproportionately hard to simulate such problems in a lab: we do not have to upgrade the web applications once they are installed, generating loads high enough is costly and tricky and we are in a relatively clean settings (no insane client behaviour).
On the other hand, in order to properly test the transaction analysis, a simple web server with a couple of missing pages is not enough: we need to install real-life web applications, and simulate realistic error conditions. Lab setups which are good enough to test a web load balancer just don't cut it when it comes to the high-end web transaction monitoring and error recovery. However, it remains our hope that other web monitoring and control tools, as well as labs trying to integrate such products, can use SCREWS.
Naive methods of generating errors turned out unreliable or unflexible, and most often both. Generating application underperformance by simulating real CPU load turned out to be unreliable (unfortunately, even modern Microsoft OSs deal with scheduling under loads remarkably well) and generating errors needed custom code to add to web applications (which necessitated using the Byzantine methods web application frameworks use to refresh code, when we wanted to simulate generating and fixing errors by system administration.) Even aside from unreliability concerns, there remained the automation concerns: our first prototype for simulated test included a distributed command architecture in order to break, and fix, web applications from a centralized location.
The first SCREWS prototype was known as delayer
, and
was used to simulate a scenario triggering the web application QoS
support: we needed for a BEA WebLogic application to delay an
ASP.NET application.
# A "rolling counter", which allows a sort of "decaying sum": # We have two counters, "now" and "prev". Every 5 seconds, we # delete prev, shift prev to now, zero out "now" and start # counting a new. Although there is a "traumatic event" every five # seconds, it will always show between 5 and 10 times the "hits per second" class Counter: def __init__(self): self.now, self.prev = 0, 0 def tick(self): self.now, self.prev = 0, self.now def inc(self): self.now += 1 def get(self): return self.now + self.prev # This resource creates a load: whenever it processes a request, it # .inc()s the counter. class LoaderResource: __implements__ = resource.IResource isLeaf = True def __init__(self, counter): self.counter = counter def render(self, request): self.counter.inc() return "OK!" # This resource suffers because of a load: whenvever it processes a request, # it waits for as many seconds as requested before returning a result. # In a situation of constant load on Loader and Suffer, which is distributed # randomly, the average latency will be 7.5*(loader hits/sec) class SufferResource: __implements__ = resource.IResource isLeaf = True def __init__(self, counter): self.counter = counter def render(self, request): def _(): request.write("OK!") request.finish() reactor.callLater(self.counter.get(), _) return server.NOT_DONE_YET
Add a little bit of glue (set up a server to serve /loader
and /suffer
correctly, set up a timer which calls .tick
, and we're all set! As soon as the application which creates the load
accesses /loader
and the application that should slow down
accesses /suffer
and waits for a result, everything magically
works!
Unfortunately, that method does not scale -- each new error scenario needs custom server code, each web application must be modified... in short, we needed something better.
We use Apache as a unified front-end to all applications: as reverse
proxy to IIS-based applications and Zope, using the BEA module for interfacing
to WebSphere and using in-process PHP for PHP applications. We added an
agent.py
which looks like
# We needed to do this in "postreadrequesthandler" and not in the # default handler because the apache "Proxy" command bypasses # the normal content handlers (which is where the default handler # registers). We connect to the delayer, get a command and act # on it. # Note: the SCREWS protocol gets the TCP-layer arguments and # the URL. There is no reason for it not to get the headers # and other data: it could be done in a completely backwards-compatible # fashion. def postreadrequesthandler(req): my_req=req.the_request my_url=my_req.split()[1] c = req.connection conn=':'.join(map(str, c.local_addr+c.remote_addr) if not my_url.startswith('/wbtree'): a = urllib2.urlopen("http://delayer:2005/"+conn+my_url) code, data =a.read().split() else: code = '000' if code == '000': return apache.DECLINED else: # XXX - Use actual code/content return apache.HTTP_INTERNAL_SERVER_ERROR
We use /wbtree
for performance tests, and we want
a minimal performance penalty (going to the network, waiting
on the server) for that. I left the hacks and incompleteness in
there on purpose -- I assume any project that will want to integrate
SCREWS will want to customize the agent -- that part of the code
will always be highly fluid.
The SCREWS server has a Decider
class which returns
what should be done for each request, according to the TCP/HTTP
parameters. Here is the code for the implementation of the decision:
def render(self, request): if len(request.postpath)<2: return '' # This ugly bit of code unwraps the silly protocol we use # to pass connection parameters as part of the URL conn = Connection(request.postpath[0]) url = '/'+request.uri[1:].split('/', 1)[1] time, code, content = (self.decider.decide(conn, url) or (0, '000', 'Nothing')) def _(): request.write('%s %s' % (code, content)) request.finish() reactor.callLater(time, _) return twserver.NOT_DONE_YET
The client uses a netstring-based protocol to pass SCREWSlets to the
server: "opinion<space><screwslet>" for adding screwslets,
and "clearopinion
variable defined in the global scope.
The value of the variable must comply to the interface:
# Must return None or (int, string, string) def decide(conn, url): pass def tick():
The server calls the tick
methods on all opinions,
once per second.
When facing a decision, it calls the opinions' decide
in order of registration, until one returns a non-None value.
The client that we wrote accepts the name of a file containing Python code, and a list of substitutions to perform. (We have a simpler client, which does no substitution).
class Error: def __init__(self, url): self.url = url def tick(self): pass def decide(self, conn, url): if url.startswith(self.url): return 0, 500, "HAHA!" opinion = Error("%(url)s") # run as screwsclient.py error.py url=/gonna-not-work/
class Delay: def __init__(self, url): self.url = url def tick(self): pass def decide(self, conn, url): if url.startswith(self.url): return 1, '000', '' # ...
class Delayer: def __init__(self, loadurl, sufferurl): self.loadurl, self.sufferurl = loadurl, sufferurl self.hits = [0]*10 def tick(self): # Performance? I mock performance self.hits = self.hits[1:]+[0] def decide(self, conn, url): if url.startswith(self.loadurl): self.hits[-1] += 1 if url.startswith(self.sufferurl): return sum(self.hits), '000', ''
class Delayer: def __init__(self, loadip, sufferip): self.loadip, self.sufferip = loadip, sufferip self.hits = [0]*10 def tick(self): # Performance? I mock performance self.hits = self.hits[1:]+[0] def decide(self, conn, url): ip = conn.split(':')[2] if self.loadip == ip: self.hits[-1] += 1 if self.sufferip === ip: return sum(self.hits), '000', ''
class Delayer: def __init__(self, loadip, sufferip): self.loadip, self.sufferip = loadip, sufferip self.hits = [0]*10 def tick(self): # Performance? I mock performance self.hits = self.hits[1:]+[0] def decide(self, conn, url): ip = conn.split(':')[2] if self.loadip == ip: self.hits[-1] += 1 if self.sufferip === ip and random.random()>(0.99**sum(self.hits)): return 0, '500', 'HAHA!'
opinions
module
containing decorators and helper functions to remove
the redundant work from SCREWSlets (include "Filter", "Strictest",
"Lenient", "Random", "Percentage" and "Delayer" [which takes two
"Filters"], ERROR=(0, '500', 'HAHA!'), DELAY(n)=(n, '000', 'Nothing'))So, remember the part where I said agent.py
was the least
important part, and the most easily replaced part, of the architecture?
Well, turns out I had to write a Twisted "web" server for accepting
seriously malformatted HTTP requests and responding well to all of them,
unless SCREWed.
# some hacks to get some url from a request, horrible though it might # be local, remote = self.transport.getHost(), self.transport.getPeer() conn = ':'.join(map(str, [local.host, local.port, remote.host, remote.port])) d = client.getPage('http://delayer/'+conn+url) d.addErrback(log.err) def _(content): if not content or content.startswith('000'): code = '200' else: code = '500' self.sendLine('HTTP/1.0 %d OK') % code # ...