thoughts, ideas, code and other things...

Saturday, July 17, 2010

Poorly implementing caching in python - eye opener

So after looking at some slides on caching function returns in javascript, I was keen on trying out so in Python. And LOL I came up with this logic -
if val in cache.keys():
return cache[val]
do the right thing...

But it seems, though "if val in cache.keys()" sounds very human friendly, it definitely would suck for a very big cache, and so it does in the following test.

I guess I'm not using timeit in the classical way where I would pass some statements in string and ask it to do them for N number of times, but it seems passing some variables from existing code to statement string is a pain, tried global etc, didn't work. Hence a simple time diff test.
# -*- coding: utf-8 -*-
import random
from timeit import Timer

class PlainColorParser:
def parse(self,value):
rgb = int(value[1:], 16)
r = rgb >> 16 & 0xff
g = rgb >> 8 & 0xff
b = rgb & 0xff
return (r,g,b)

def __call__(self,value):
return self.parse(value)

class NoBitColorParser:
def __call__(self,value):
return (int(value[1:3],16), int(value[3:5],16), int(value[5:7],16))

class PoorCachedColorParser(PlainColorParser):
def __init__(self):
self.cache = {}

def __call__(self,value):
if value in self.cache.keys():
return self.cache[value]
self.cache[value] = self.parse(value)
return self.cache[value]

class CachedExceptionColorParser(PlainColorParser):
def __init__(self):
self.cache = {}

def __call__(self,value):
return self.cache[value]
except KeyError:
self.cache[value] = self.parse(value)
return self.cache[value]

class CachedColorParser(PlainColorParser):
def __init__(self):
self.cache = {}

def __call__(self,value):
if value in self.cache:
return self.cache[value]
self.cache[value] = self.parse(value)
return self.cache[value]

if __name__ == "__main__":
t = Timer()
pccParse = PoorCachedColorParser()
cecParse = CachedExceptionColorParser()
ccParse = CachedColorParser()
pcParse = PlainColorParser()
nbParse = NoBitColorParser()

# setup some random data to test
colors = []
for i in xrange(100000):
colors.append("#" + hex(random.randint(0xfe0000, 0xff0aff))[2:])

def timeDiff(obj):
start = t.timer()
for c in colors:
stop = t.timer()
return ((1000000*stop - 1000000*start)/1000000)

# ---- test poorly cached
print "Poorly Cached - %.2fs" % timeDiff(pccParse)

# ---- test exception cached
print "Exception Cached - %.2fs" % timeDiff(cecParse)

# ---- test cached
print "Cached - %.2fs" % timeDiff(ccParse)

# ---- test uncached
print "Non Cached - %.2fs" % timeDiff(pcParse)

# ---- test no bitwise, uncached
print "Not Bitwise, Non Cached - %.2fs" % timeDiff(nbParse)

So we've got 4 5 classes to represent different ways of parsing an html hex code for color e.g. "#f00f00" into a tuple of (r,g,b) integers. PlainColorParser and NoBitColorParser could easily be functions with no need of classes over them as they do not cache, but to bring them a little equal to other two cached ones, I've bound them in classes.

NoBitColorParser does string manipulations and parses 3 times before returning a tuple. PlainColorParser does better than that, it uses bit shifts and AND masks to filter out content after 1 round of parsing integer from string. PoorCachedColorParser does caching in an obvious way "if its there in cache keys... else ...", and CachedColorParser CachedExceptionColorParser complies to the philosophy of "Fail early, fail often", which is quite interesting :D
but recent findings reveal that CachedColorParser is the right, fast, pythonic way.

What the test does is - generate 100000 random color codes, pick them from a range of 2816 colors (0xff0aff - 0xff0000 + 1). Obviously many colors are bound to get repeated says pigeon hole.

Here's what goes on in an average run on my machine -
abhishekmishra@mbp [~/code]> python
Poorly Cached - 340.23s
Exception Cached - 0.30s
Cached - 0.18s
Non Cached - 0.15s
Not Bitwise, Non Cached - 0.16s

"if value in self.cache.keys():" in PoorCachedColorParser gives you a thumbs down with a sucky performance, obviously not the right thing to do!!! (I was mistaken)

CachedExceptionColorParser gives a sweet 0.30s, "Fail early, Fail often" works :)
But wait, CachedColorParser goes further even a bit more with just 0.18s.

NoBitColorParser is suckier than the Non Cached PlainColorParser, which points out that string ops, parsing integers is one costly affair.

So much for food to my sleeplessness. Oh I remember doing something similar in PyMos too :D.

More updates -

Looking at Dhananjay's code on BangPypers, I think I was too excited to throw in the idion of fail fast in this place, it can albeit be done in a cleaner way. So instead of -
return self.cache[value]
except KeyError:
self.cache[value] = self.parse(value)
return self.cache[value]

You could just write in much cleaner way -
if value in self.cache:
return self.cache[value]
self.cache[value] = self.parse(value)
return self.cache[value]

So the issue was with cache.keys(), which now seems like an obvious slow and shitty way.
New stats reveal that even Try: Except:... fail fast is even not the right way.

Notice the almost double time difference between "try.. except.." way and "if value in self.cache".

Lesson learnt :)

Labels: , , ,


At August 8, 2010 at 3:13 PM , Anonymous Anonymous said...

I don't get it.

>>> a = {}
>>> type(a.keys())

>>> type(a)

Isn't understanding membership test costs as simple as understanding what data structure lies beneath?

Or are you pointing out that you didn't know python-provided sets and dicts can be directly queried?

At August 8, 2010 at 3:20 PM , Blogger Abhishek Mishra said...

Yes towards the end, the updates reveal that I did not know that dict is directly queriable. Hence an eye opener to me, obviousness to others :)


Post a Comment

Subscribe to Post Comments [Atom]

<< Home