Exploring Scraping Methods
By Thomas Krehbiel
· Krehbiel Tech · Thursday, May 3, 2007, 7:16 PM · 556 words · 2 comments · ![]()
I've got a working prototype of the UvMoney system now. It's pretty rough around the edges (especially in the user-interface department), but it works. I'm at a natural stopping point, so this seems like a good time to take a step back and ruminate on it.
While the system works for me, I'm having a hard time envisioning a way to make it work for anyone else, which would obviously be a major impediment toward any sort of public distribution. Clearly, AutoHotkey scripts are not going to cut it.
I came across another bit of automation software called iMacros (from iOpus), which claims to be a solution specifically for scraping data from web sites through automating form entry. It sounds like it would work much better than AutoHotkey, but naturally the fully scriptable version I probably need is the most expensive one ($499). And of course there would be licensing and distribution issues with that.
I also wondered about a Javascript solution. Theoretically, if I could just insert a little bit of Javascript into the incoming bank pages, I could populate the login forms and submit them that way. But that means writing some kind of proxy. Then I wondered if Firefox had any kind of add-on where you could run Javascript against a page after it's loaded. I first looked at the Venkman Javascript Debugger, which is neat but it doesn't seem to debug the scripts on the page you're looking at. Next I looked at Firebug, which holds the functionality I want within the clutches of its Javascript console, but again it looks like I'll need something like AutoHotkey to drive it automatically.
Of course, ideally I wouldn't use a web browser at all. But that means figuring out how to get data directly from bank sites, and that means sending HttpWebRequests. I've played with this option a fair amount, and so far, it's not looking very easy. Banks seem eager to obfuscate the process of logging into their web sites, so it's not just a simple matter of sending an HTTP POST request. (Well, it probably boils down to that, but combing through all the frames, cookies, Javascript, and hidden form fields to reverse engineer the correct procedures and parameters is not terribly intuitive or fun.)
I did some searching to see if anyone else had solved the problem of scraping data from bank sites. I found references to banking "aggregators" -- one from Yodlee.com, and one from Quicken ("My Accounts"). Neither of those solutions are open for anyone to play with, which makes them pretty useless to me. I also saw a reference to ScrapeGoat, but I read that they not only don't give you any code, they only make custom solutions that run on their servers. Other than that I've come up empty.
So from those examples I can conclude that it's possible to scrape bank data but, without cooperation from the banks themselves, it's a somewhat frustrating trial-and-error process. I'll keep plugging away at it.
P.S. Scraping is sometimes frowned upon in the Internet community, because it can be (and is) used by nefarious webmasters to plagiarize content from other sites. However, I think most people would agree that the type of scraping I'm talking about here is perfectly valid, since it's my own banking information that I'm trying to scrape.
Reader Comments
Add a Comment
| Name: | (optional) |
| Comment: | |
Comments are the property of their respective owners.
1. Sean/Red said,
I agree a web browser is not the best solution, but since you've already investigated firefox plug-ins, you've neglected the "Grease Monkey" plugin. I've never used it, but I think it may do what you want. Also, Mozilla has a COM interface for the Gecko rendering engine, you may want to try that. You may also want to check out the opensource project gnuCash (you probably already found this while searching for OFX info).
Thursday, May 3, 2007, 9:45 PM
2. Tom said,
I didn't see a Windows version of GnuCash?
I had a breakthrough on HttpWebRequests so I'm hopeful I can ditch the web browser automation entirely. (Turns out I had to set the UserAgent string to make the Ameriprise site work.)
Friday, May 4, 2007, 3:24 PM