IdeaMonk

Scraping in Python for fun and profit

Missed out on this one lately.

Q1 of BORQ - Mad Libs

My solution for Mad Libs, I need to practice some more regular expression. Rubular is a great help for people like me. RegExp is in a more free-form in Ruby compared to python, which is quite good for quick experiments on irb.

	# solution to simpler part
	# puts gets.chop.gsub!( /$\([^))]*$\)/) { \|w\| w=gets.chop}

	# part 2 - along with substitution
	dict = {}
	puts IO.read(ARGV.shift).chop.gsub!( /$\([^))]*$\)/) { \|w\|
	w = w.scan(/[^()]+/)[0]
	combo = w.split(":")
	if (combo.size==2)
	print "Enter #{combo[1]} : "
	w = dict[combo[0]] = gets.chop
	else
	if (dict[combo[0]] == nil)
	print "Enter #{combo[0]} : "
	w=gets.chop
	else
	w = dict[combo[0]]
	end
	end
	}

view raw BORQ_1_MadLibs.rb hosted with ❤ by GitHub

I wish I could even get rid of the "if"s and make it sexier. Can't think more on this one right now.

Labels: Ruby

Q2. of BORQ - LCD display

This is an old one, pretty famously seen on uva, etc. I found Choice to be better than OptParse, as said in Choice docs - it like writing poems for command-line parsing :D

	# Author: Abhishek Mishra <ideamonk@gmail.com>

	require 'rubygems'
	require 'choice'

	Choice.options do
	banner 'q2_lcd_numbers.rb [-shv] string'

	separator 'Optional:'
	option :size do
	short '-s'
	desc 'digit size'
	cast Integer
	default 1
	end

	separator 'Common:'
	option :help do
	short '-h'
	long '--help'
	desc 'Show this message.'
	end

	option :version do
	short '-v'
	long '--version'
	desc 'Show version.'
	action do
	puts 'LCD generator 0.1'
	exit
	end
	end
	end

	def paint_map(map,digit,size,start)
	# upper, mid, bottom
	for i in (start+1)...(start+size+1)
	if $digits[digit][0] == 1
	map[[0,i]] = 2
	end
	if $digits[digit][3] == 1
	map[[0 + size + 1,i]] = 2
	end
	if $digits[digit][6] == 1
	map[[0 + size*2 + 2,i]] = 2
	end
	end
	# verticals
	for i in 1...size+1
	if $digits[digit][1] == 1
	map[[i,0+start]] = 1
	end
	if $digits[digit][2] == 1
	map[[i,size+1+start]] = 1
	end
	if $digits[digit][4] == 1
	map[[i+size+1,0+start]] = 1
	end
	if $digits[digit][5] == 1
	map[[i+size+1,size+1+start]] = 1
	end
	end
	end

	def print_map(map, size, total_digits)
	for j in 0...(size*2+3)
	for i in 0...(total_digits*(size+3))
	if map[[j,i]] == 2
	print '-'
	elsif map[[j,i]] == 1
	print '\|'
	else
	print ' '
	end
	end
	puts ''
	end
	end

	def drive_paint(map, string, size)
	i = 0
	j = 0
	while (i<string.size*(size+3))
	paint_map(map,string[j..j].to_i,size,i)
	i += size+3
	j += 1
	end
	end

	# _ 0
	# \| \| 1 2
	# - 3
	# \| \| 4 5
	# - 6

	$digits = [
	[1,1,1,0,1,1,1], [0,0,1,0,0,1,0], # 0 1
	[1,0,1,1,1,0,1], [1,0,1,1,0,1,1], # 2 3
	[0,1,1,1,0,1,0], [1,1,0,1,0,1,1], # 4 5
	[1,1,0,1,1,1,1], [1,0,1,0,0,1,0], # 6 7
	[1,1,1,1,1,1,1], [1,1,1,1,0,1,0], # 8 9
	]

	string = ARGV[-1]
	size = Choice.choices[:size]
	digitmap = {}

	drive_paint(digitmap, string, size)
	print_map(digitmap, size, string.size)

view raw q2_borq_lcd.rb hosted with ❤ by GitHub

Initially I was doing a string[j].chr.to_i which seemed like a required bullshit, I wish it could've been string[j].to_i but string[i] gives back a decimal. Now one could talk about the C style '5'-'0' conversion to number, but then, that machine dependent (think ascii, ebcdic, etc). The slightly more ruby-ish way - string[j..j].to_i :)

Here's my number -

abhishekmishra@mbp [~/code/BORQ]> ruby q2_lcd_numbers.rb -s 3 9535009187
 ---   ---   ---   ---   ---   ---   ---         ---   ---
|   | |         | |     |   | |   | |   |     | |   |     |
|   | |         | |     |   | |   | |   |     | |   |     |
|   | |         | |     |   | |   | |   |     | |   |     |
 ---   ---   ---   ---               ---         ---
    |     |     |     | |   | |   |     |     | |   |     |
    |     |     |     | |   | |   |     |     | |   |     |
    |     |     |     | |   | |   |     |     | |   |     |
       ---   ---   ---   ---   ---               ---

Surprisingly the PyMos core code is 180 lines while this reaches upto 108. Python => more results per line? The absence of 'end' in python is one big reason for this. Besides I dont find my ruby code so ruby-ish at this stage.

Labels: Ruby

Speaking at PyCon this weekend :)

Here are the slides for talk at the upcoming PyCon India 2010. See you this weekend.

Scraping with Python for Fun and Profit - PyCon India 2010

View more presentations from Abhishek Mishra.

See you there!

Labels: pycon, python

Poorly implementing caching in python - eye opener

So after looking at some slides on caching function returns in javascript, I was keen on trying out so in Python. And LOL I came up with this logic -

fun(val):
  if val in cache.keys():
    return cache[val]
  else:
    do the right thing...

But it seems, though "if val in cache.keys()" sounds very human friendly, it definitely would suck for a very big cache, and so it does in the following test.

I guess I'm not using timeit in the classical way where I would pass some statements in string and ask it to do them for N number of times, but it seems passing some variables from existing code to statement string is a pain, tried global etc, didn't work. Hence a simple time diff test.

view plain print ?

# -*- coding: utf-8 -*-
import random
from timeit import Timer
class PlainColorParser:
def parse(self,value):
rgb = int(value[1:], 16)
r = rgb >> 16 & 0xff
g = rgb >> 8 & 0xff
b = rgb & 0xff
return (r,g,b)
def __call__(self,value):
return self.parse(value)
class NoBitColorParser:
def __call__(self,value):
return (int(value[1:3],16), int(value[3:5],16), int(value[5:7],16))
class PoorCachedColorParser(PlainColorParser):
def __init__(self):
self.cache = {}
def __call__(self,value):
if value in self.cache.keys():
return self.cache[value]
self.cache[value] = self.parse(value)
return self.cache[value]
class CachedExceptionColorParser(PlainColorParser):
def __init__(self):
self.cache = {}
def __call__(self,value):
try:
return self.cache[value]
except KeyError:
self.cache[value] = self.parse(value)
return self.cache[value]
class CachedColorParser(PlainColorParser):
def __init__(self):
self.cache = {}
def __call__(self,value):
if value in self.cache:
return self.cache[value]
else:
self.cache[value] = self.parse(value)
return self.cache[value]
if __name__ == "__main__":
t = Timer()
pccParse = PoorCachedColorParser()
cecParse = CachedExceptionColorParser()
ccParse = CachedColorParser()
pcParse = PlainColorParser()
nbParse = NoBitColorParser()
# setup some random data to test
colors = []
for i in xrange(100000):
colors.append("#" + hex(random.randint(0xfe0000, 0xff0aff))[2:])
def timeDiff(obj):
start = t.timer()
for c in colors:
obj(c)
stop = t.timer()
return ((1000000*stop - 1000000*start)/1000000)
# ---- test poorly cached
print "Poorly Cached - %.2fs" % timeDiff(pccParse)
# ---- test exception cached
print "Exception Cached - %.2fs" % timeDiff(cecParse)
# ---- test cached
print "Cached - %.2fs" % timeDiff(ccParse)
# ---- test uncached
print "Non Cached - %.2fs" % timeDiff(pcParse)
# ---- test no bitwise, uncached
print "Not Bitwise, Non Cached - %.2fs" % timeDiff(nbParse)

# -*- coding: utf-8 -*-
import random
from timeit import Timer

class PlainColorParser:
    def parse(self,value):
        rgb = int(value[1:], 16)
        r = rgb >> 16 & 0xff
        g = rgb >>  8 & 0xff
        b = rgb & 0xff
        return (r,g,b)

    def __call__(self,value):
        return self.parse(value)

class NoBitColorParser:
    def __call__(self,value):
        return (int(value[1:3],16), int(value[3:5],16), int(value[5:7],16))

class PoorCachedColorParser(PlainColorParser):
    def __init__(self):
        self.cache = {}

    def __call__(self,value):
        if value in self.cache.keys():
            return self.cache[value]
        self.cache[value] = self.parse(value)
        return self.cache[value]

class CachedExceptionColorParser(PlainColorParser):
    def __init__(self):
        self.cache = {}

    def __call__(self,value):
        try:
            return self.cache[value]
        except KeyError:
            self.cache[value] = self.parse(value)
            return self.cache[value]

class CachedColorParser(PlainColorParser):
    def __init__(self):
        self.cache = {}

    def __call__(self,value):
        if value in self.cache:
            return self.cache[value]
        else:
            self.cache[value] = self.parse(value)
            return self.cache[value]

if __name__ == "__main__":
    t = Timer()
    pccParse = PoorCachedColorParser()
    cecParse = CachedExceptionColorParser()
    ccParse = CachedColorParser()
    pcParse = PlainColorParser()
    nbParse = NoBitColorParser()

    # setup some random data to test
    colors = []
    for i in xrange(100000):
        colors.append("#" + hex(random.randint(0xfe0000, 0xff0aff))[2:])


    def timeDiff(obj):
        start = t.timer()
        for c in colors:
            obj(c)
        stop = t.timer()
        return ((1000000*stop - 1000000*start)/1000000)

    # ---- test poorly cached
    print "Poorly Cached - %.2fs" % timeDiff(pccParse)

    # ---- test exception cached
    print "Exception Cached - %.2fs" % timeDiff(cecParse)

    # ---- test cached
    print "Cached - %.2fs" % timeDiff(ccParse)

    # ---- test uncached
    print "Non Cached - %.2fs" % timeDiff(pcParse)

    # ---- test no bitwise, uncached
    print "Not Bitwise, Non Cached - %.2fs" % timeDiff(nbParse)

So we've got 4 5 classes to represent different ways of parsing an html hex code for color e.g. "#f00f00" into a tuple of (r,g,b) integers. PlainColorParser and NoBitColorParser could easily be functions with no need of classes over them as they do not cache, but to bring them a little equal to other two cached ones, I've bound them in classes.

NoBitColorParser does string manipulations and parses 3 times before returning a tuple. PlainColorParser does better than that, it uses bit shifts and AND masks to filter out content after 1 round of parsing integer from string. PoorCachedColorParser does caching in an obvious way "if its there in cache keys... else ...", and ~~CachedColorParser~~ CachedExceptionColorParser complies to the philosophy of "Fail early, fail often", which is quite interesting :D
but recent findings reveal that CachedColorParser is the right, fast, pythonic way.

What the test does is - generate 100000 random color codes, pick them from a range of 2816 colors (0xff0aff - 0xff0000 + 1). Obviously many colors are bound to get repeated says pigeon hole.

Here's what goes on in an average run on my machine -

abhishekmishra@mbp [~/code]> python pycaching.py
Poorly Cached - 340.23s
Exception Cached - 0.30s
Cached - 0.18s
Non Cached - 0.15s
Not Bitwise, Non Cached - 0.16s

"if value in self.cache.keys():" in PoorCachedColorParser gives you a thumbs down with a sucky performance, obviously not the right thing to do!!! (I was mistaken)

CachedExceptionColorParser gives a sweet 0.30s, "Fail early, Fail often" works :)
But wait, CachedColorParser goes further even a bit more with just 0.18s.

NoBitColorParser is suckier than the Non Cached PlainColorParser, which points out that string ops, parsing integers is one costly affair.

So much for food to my sleeplessness. Oh I remember doing something similar in PyMos too :D.

More updates -

Looking at Dhananjay's code on BangPypers, I think I was too excited to throw in the idion of fail fast in this place, it can albeit be done in a cleaner way. So instead of -

try:
    return self.cache[value]
except KeyError:
    self.cache[value] = self.parse(value)
    return self.cache[value]

You could just write in much cleaner way -

if value in self.cache:
    return self.cache[value]
else:
    self.cache[value] = self.parse(value)
    return self.cache[value]

So the issue was with cache.keys(), which now seems like an obvious slow and shitty way.
New stats reveal that even Try: Except:... fail fast is even not the right way.

Notice the almost double time difference between "try.. except.." way and "if value in self.cache".

Lesson learnt :)

Labels: caching, experiments, memoization, python

अपने घर के चोर

You find it to be a sorry ass pussy system 
that fails to respond to situations in time, 
you read up, watch stuff and blame on them for "not doing anything", 
but hardly would it strike to your conditioned imagination 
that they might just be so pre-occupied in doing something else 
that you could have never imagined. 

And yet you passive cunt, you take it all for granted
and sleep for the next day to come by,
expecting everything to be normal.

Who knows for everything,
that you quoted the system to be pussy about,
over your cup of tea and a morning newspaper,
there were these "अपने घर के चोर",
thiefs, filling their pockets.


On a side note,
Now that things have become so well connected,
the systems are so well informed, fed by numerous channels,
and fingers just like the ones that typed out this piece of brainf*ck,
clicking out zillions of preferences,
we're not far from the point when
it would be possible to crunch these into
a whole new understanding, meaning, life.

Just that in this spiderweb,
analyzing nodes, reaching from one to another,
predicting their minds and way of thinking has become easier.

Labels: rants

DWM-ified

Arch + Dwm + Conky + Dzen2 + links + vim + irssi (plus ram eaten by chrome, firefox, thunderbird, and eden) otherwise awesome at ~100MB, ideal light & fast setup for my netbook :)

Labels: archlinux

Release early, release often, release immediately?

Okay nothing much for this post, but I was just thinking about the RERO philosophy apparently popularized by ESR in CATB (I guess I won't need to read the 256 pages).

Anyways so I was wondering if releasing immediately is a good thing to do? Picture yourself working whole night over a cool project that you intend to release for the public. Its 6 AM, you're done with your last commit, nothing seems to be broken, yet is it the right time to release? The problem has nothing to do with any sort of technicalities, but is a psychological one and completely based on my experiences with such quick releases.

The question is, are you left with enough energy after working for 2-3 days straight to take the criticism, suggestions and comments seriously in a positive way? At least I as a lone developer of something that I put forward to the public haven't been able to harness the feedback that people gave me. I ponder over the reasons, and I'm not sure of why is that. But mostly it went something like this -

"Could you possibly include feature XYZ into it?"
and I think - oh cummon, I just got over with this work and now I need a break.
"This stuff seems broken, I cant move X to Y."
and I think - who said I made it for you :/
"It would be nice if you include ABC and XYZ also"
and I just write a good reply to the person, add it to my non-existant list of todos in mind or Tasks in gmail and it stays there forever.

This has happened many a times, even valuable suggestions have just ended up in a todo list rather than getting implemented and the projects hardly getting touched again.

How do you go about it? or is it just me who keeps hopping over ideas and leaves the old ones half done. Or is it the procrastination that engulfs on dawn of winter and summer vacations.

An interesting observation over procrastination. During this end sem exams, we were given a 5 day prep leave, and LOL I completely spent the whole time writing some piece of code instead of doing maths. Same story for any intermediate breaks between exams, reading something, watching some anime, writing some code but zero preparation for any exam. And yes a hell lot of ideas cooking in your mind for coming vacations.

Surprisingly it happens, not once but many times, that when exams end - I go completely blank over things I had planned. And with "Oh come on! exams just got over!" mentality I start spending days being useless to myself and everyone else.

And even though some new work arrives, I still end up behaving as if I'm super busy with things. Hoping to get rid of this soon and maybe hack this "Not doing what I am supposed to do at this time." way of life for greater goods.

</procrastination>

Okay the blogger WYSIWYG editor is really not good on chrome. It doesn't render <, > properly and end up putting in loads of DIVs instead of beautiful Ps as in firefox. Why can't they have a standard across two browsers for same webapp.

Labels: foo, musings

Wednesday, January 05, 2011

Scraping in Python for fun and profit

Sunday, October 24, 2010

Q1 of BORQ - Mad Libs

Q2. of BORQ - LCD display

Wednesday, September 22, 2010

Speaking at PyCon this weekend :)

Saturday, July 17, 2010

Poorly implementing caching in python - eye opener

Wednesday, July 14, 2010

अपने घर के चोर

Monday, June 21, 2010

DWM-ified

Friday, May 21, 2010

Release early, release often, release immediately?

About me

labels

projects

game-dev

open source

python

programming

artwork

Recently...

Archives

elsewhere...

traffic

	# solution to simpler part
	# puts gets.chop.gsub!( /\(\([^))]*\)\)/) { \|w\| w=gets.chop}

	# part 2 - along with substitution
	dict = {}
	puts IO.read(ARGV.shift).chop.gsub!( /\(\([^))]*\)\)/) { \|w\|
	w = w.scan(/[^()]+/)[0]
	combo = w.split(":")
	if (combo.size==2)
	print "Enter #{combo[1]} : "
	w = dict[combo[0]] = gets.chop
	else
	if (dict[combo[0]] == nil)
	print "Enter #{combo[0]} : "
	w=gets.chop
	else
	w = dict[combo[0]]
	end
	end
	}

Wednesday, January 05, 2011

Scraping in Python for fun and profit

Sunday, October 24, 2010

Q1 of BORQ - Mad Libs

Q2. of BORQ - LCD display

Wednesday, September 22, 2010

Speaking at PyCon this weekend :)

Saturday, July 17, 2010

Poorly implementing caching in python - eye opener

Wednesday, July 14, 2010

अपने घर के चोर

Monday, June 21, 2010

DWM-ified

Friday, May 21, 2010

Release early, release often, release immediately?

About me

labels

projects

game-dev

open source

python

programming

artwork

Recently...

Archives

Subscribe

elsewhere...

traffic