SCREWing Web Servers

Why Screw?

B-hive Conductor is a solution for giving enterprises visibility and control for web-enabled transactions. As much as our clients consider web application errors and underperformance a fact of life, it turns disproportionately hard to simulate such problems in a lab: we do not have to upgrade the web applications once they are installed, generating loads high enough is costly and tricky and we are in a relatively clean settings (no insane client behaviour).

On the other hand, in order to properly test the transaction analysis, a simple web server with a couple of missing pages is not enough: we need to install real-life web applications, and simulate realistic error conditions. Lab setups which are good enough to test a web load balancer just don't cut it when it comes to the high-end web transaction monitoring and error recovery. However, it remains our hope that other web monitoring and control tools, as well as labs trying to integrate such products, can use SCREWS.

Why SCREWS?

Naive methods of generating errors turned out unreliable or unflexible, and most often both. Generating application underperformance by simulating real CPU load turned out to be unreliable (unfortunately, even modern Microsoft OSs deal with scheduling under loads remarkably well) and generating errors needed custom code to add to web applications (which necessitated using the Byzantine methods web application frameworks use to refresh code, when we wanted to simulate generating and fixing errors by system administration.) Even aside from unreliability concerns, there remained the automation concerns: our first prototype for simulated test included a distributed command architecture in order to break, and fix, web applications from a centralized location.

The first SCREWS prototype was known as delayer, and was used to simulate a scenario triggering the web application QoS support: we needed for a BEA WebLogic application to delay an ASP.NET application.

# A "rolling counter", which allows a sort of "decaying sum":
# We have two counters, "now" and "prev". Every 5 seconds, we
# delete prev, shift prev to now, zero out "now" and start
# counting a new. Although there is a "traumatic event" every five
# seconds, it will always show between 5 and 10 times the "hits per second"
class Counter:
    def __init__(self):
        self.now, self.prev = 0, 0
    def tick(self):
        self.now, self.prev = 0, self.now
    def inc(self):
        self.now += 1
    def get(self):
        return self.now + self.prev

# This resource creates a load: whenever it processes a request, it
# .inc()s the counter.
class LoaderResource:
    __implements__ = resource.IResource
    isLeaf = True
    def __init__(self, counter):
        self.counter = counter
    def render(self, request):
        self.counter.inc()
        return "OK!"

# This resource suffers because of a load: whenvever it processes a request,
# it waits for as many seconds as requested before returning a result.
# In a situation of constant load on Loader and Suffer, which is distributed
# randomly, the average latency will be 7.5*(loader hits/sec)
class SufferResource:
    __implements__ = resource.IResource
    isLeaf = True
    def __init__(self, counter):
        self.counter = counter
    def render(self, request):
        def _():
           request.write("OK!")
           request.finish()
        reactor.callLater(self.counter.get(), _)
        return server.NOT_DONE_YET

Add a little bit of glue (set up a server to serve /loader and /suffer correctly, set up a timer which calls .tick, and we're all set! As soon as the application which creates the load accesses /loader and the application that should slow down accesses /suffer and waits for a result, everything magically works!

Unfortunately, that method does not scale -- each new error scenario needs custom server code, each web application must be modified... in short, we needed something better.

How SCREWS

Proxy/LAMP Architecture

We use Apache as a unified front-end to all applications: as reverse proxy to IIS-based applications and Zope, using the BEA module for interfacing to WebSphere and using in-process PHP for PHP applications. We added an agent.py which looks like

# We needed to do this in "postreadrequesthandler" and not in the
# default handler because the apache "Proxy" command bypasses
# the normal content handlers (which is where the default handler
# registers). We connect to the delayer, get a command and act
# on it.
# Note: the SCREWS protocol gets the TCP-layer arguments and
# the URL. There is no reason for it not to get the headers
# and other data: it could be done in a completely backwards-compatible
# fashion.
def postreadrequesthandler(req):
    my_req=req.the_request
    my_url=my_req.split()[1]
    c = req.connection
    conn=':'.join(map(str, c.local_addr+c.remote_addr)
    if not my_url.startswith('/wbtree'):
        a = urllib2.urlopen("http://delayer:2005/"+conn+my_url)
        code, data =a.read().split()
    else:
        code = '000'
    if code == '000':
        return apache.DECLINED
    else:
        # XXX - Use actual code/content
        return apache.HTTP_INTERNAL_SERVER_ERROR

We use /wbtree for performance tests, and we want a minimal performance penalty (going to the network, waiting on the server) for that. I left the hacks and incompleteness in there on purpose -- I assume any project that will want to integrate SCREWS will want to customize the agent -- that part of the code will always be highly fluid.

SCREWS Server

The SCREWS server has a Decider class which returns what should be done for each request, according to the TCP/HTTP parameters. Here is the code for the implementation of the decision:

    def render(self, request):
        if len(request.postpath)<2:
            return ''
        # This ugly bit of code unwraps the silly protocol we use
        # to pass connection parameters as part of the URL
        conn = Connection(request.postpath[0])
        url = '/'+request.uri[1:].split('/', 1)[1]
        time, code, content = (self.decider.decide(conn, url) or
                               (0, '000', 'Nothing'))
        def _():
            request.write('%s %s' % (code, content))
            request.finish()
        reactor.callLater(time, _)
        return twserver.NOT_DONE_YET

SCREWS client

The client uses a netstring-based protocol to pass SCREWSlets to the server: "opinion<space><screwslet>" for adding screwslets, and "clear" for clearing all existing SCREWSlets. SCREWSlets are Python code snippets which have an opinion variable defined in the global scope. The value of the variable must comply to the interface:

    # Must return None or (int, string, string)
    def decide(conn, url):
        pass
    def tick():

The server calls the tick methods on all opinions, once per second. When facing a decision, it calls the opinions' decide in order of registration, until one returns a non-None value.

The client that we wrote accepts the name of a file containing Python code, and a list of substitutions to perform. (We have a simpler client, which does no substitution).

Examples

Error out a URL

class Error:
    def __init__(self, url):
        self.url = url
    def tick(self): pass
    def decide(self, conn, url):
        if url.startswith(self.url):
            return 0, 500, "HAHA!"
opinion = Error("%(url)s")
# run as screwsclient.py error.py url=/gonna-not-work/

Delay a URL

class Delay:
    def __init__(self, url):
        self.url = url
    def tick(self): pass
    def decide(self, conn, url):
        if url.startswith(self.url):
            return 1, '000', ''
# ...

The original delayer

class Delayer:
    def __init__(self, loadurl, sufferurl):
        self.loadurl, self.sufferurl = loadurl, sufferurl
        self.hits = [0]*10
    def tick(self):
        # Performance? I mock performance
        self.hits = self.hits[1:]+[0]
    def decide(self, conn, url):
        if url.startswith(self.loadurl):
            self.hits[-1] += 1
        if url.startswith(self.sufferurl):
            return sum(self.hits), '000', ''

Hate the users from abroad?

class Delayer:
    def __init__(self, loadip, sufferip):
        self.loadip, self.sufferip = loadip, sufferip
        self.hits = [0]*10
    def tick(self):
        # Performance? I mock performance
        self.hits = self.hits[1:]+[0]
    def decide(self, conn, url):
        ip = conn.split(':')[2]
        if self.loadip == ip:
            self.hits[-1] += 1
        if self.sufferip === ip:
            return sum(self.hits), '000', ''

Error out more and more often

class Delayer:
    def __init__(self, loadip, sufferip):
        self.loadip, self.sufferip = loadip, sufferip
        self.hits = [0]*10
    def tick(self):
        # Performance? I mock performance
        self.hits = self.hits[1:]+[0]
    def decide(self, conn, url):
        ip = conn.split(':')[2]
        if self.loadip == ip:
            self.hits[-1] += 1
        if self.sufferip === ip and random.random()>(0.99**sum(self.hits)):
            return 0, '500', 'HAHA!'

Future directions

Recent anecdote

So, remember the part where I said agent.py was the least important part, and the most easily replaced part, of the architecture? Well, turns out I had to write a Twisted "web" server for accepting seriously malformatted HTTP requests and responding well to all of them, unless SCREWed.

        # some hacks to get some url from a request, horrible though it might
        # be
        local, remote = self.transport.getHost(), self.transport.getPeer()
        conn = ':'.join(map(str, [local.host, local.port,
                                  remote.host, remote.port]))
        d = client.getPage('http://delayer/'+conn+url)
        d.addErrback(log.err)
        def _(content):
            if not content or content.startswith('000'):
                code = '200'
            else:
                code = '500'
            self.sendLine('HTTP/1.0 %d OK') % code
            # ...