craigslist spider project

code snip of parsing in php

Working on parsing the various result screens that are possible. This is the nucleus of the entire project: capturing specific results.

I'm trying to isolate the individual results at this point. Part of that is weeding out the "NEARBY" results craigslist provides if there aren't many local results. The ultimate goal, of course, is to aggregate NATIONWIDE results.

Comments

Just realized I'm mucking around in mud. I've used php's strip_tags(), so that gave me a very poorly structured string to work on-- especially since craigslist searches return variously formatted results. Be that as it may, I'm continuing this way. PHP's strip_tags() at least reduced a great deal of complexity, and you can specify to return specific tags (e.g. tags, which here represent links to actual listings).

Issues:

Javascript would be a better language for most of this project because:
- we don't want all this activity happening on one IP address: craigslist will just shut it down.
- even if craigslist blacklists someone else's IP address, they can just restart their router and get another one.
-------> more interestingly, I believe (though I've never used it) Javascript is what enables a page to keep loading more onto that page as one scrolls down it. ***That*** way, additional queries could be done as the user scrolls down, reducing the "weird activity" problem (of craigslist detecting too much querying at once).

Meanwhile, I could play with SQLite instead of a full blown (and heavy) MySQL table(s).

Or, better yet, I probably won't need more than a 3 dimensional array to hold each listing's information. Ultimately, I just need the link to the listing, so the user can click on it and explore more at "the real craigslist". --prolly should make that a pop-up window, so the user doesn't lose the results of the query.