Food, Web Development, Music, and the funny crap
RT @FAILBlog Boat Name FAIL - FAIL Blog: Epic Fail Funny Pictures and Funny Videos of Owned, Pwned.. http://bit.ly/9P1pcW
My RSS Feed My Facebook My Twitter

The freaking crawler

  • So a little about the current project, which is on its last bend to be completed; the exxxcavate crawler. The crawler is intended to crawl data from sites ( inputed into the database ) and get listings from sites like redtube and youporn. right now i have only set it up to crawl exactly eight sites. There is 40 or so of these sites this project is including.

    the crawler is built using cUrl as the main engine for grabing sites data. I have it set that you type in a sites basic info, like base url and where the categories are listed. The categorys are good only if the category is easily grabed. Some sites dont have the plain text of the category in view, so sometimes its easier to crawl the sites by ‘most recent’ .  It first gets info from the database as to where it has been ( which site was crawled last, selects the next one on the list and then recalls the last listing grabed for that site ) and then searches for the next one. First by loading a page with the video listings on it – grabing info such as duration and views if available. And then reads into the listings actuale page for title, and embed code. It then places this info into wordpress, as a post, under the sites name as a category, and then a sub category based on the category of the source listing or the first tag found. And then places the tags, and then finally the extra info as Custom Fields ( such as thumbnail, listing link, duration, views, and embed code ).

    At this point, the crawlers trigger is a single file that runs and grabs one listing. the reason for the single listing grab is because the site is currently on cirtex hosting, which is shared hosting that allows porno. BUT, its still shared hosting and mysql’s memory limit is still something like 8megs. I did have it set to grab a whole page of listings, but it would run out of memory by the fifth entry. the only way to really clear the mysqls cache was to have the script terminate ( no fulsh() ob_flush() sleep(x) or mysql_free_resource would help this )

    But the crawler is effecient, and is barely screwing up. Its solid and looking really good at this point.

    The most hardest part ( for me ) is the Cron jobs. Calling a single file is what triggers the crawler, and it just needs to be ran, it figures out the rest. when i first applied a cron job last week, it woulden’t go. I kept waiting and waiting, but it woulden’t run. I looked it up a million times and at one point, it starting running. That time, i was wanting it to run every 15 minutes. It seemed like it was doing good, and then it stoped. By this point, it was like 7am and i was tired ( go figure ), so i went to bed. When i awoke some time later, i had noticed the cron started running again. It didn’t do very well, and stoped turning around at some point, so i ended it. Now i feel its ready again, and i just added the cron jobs, but nothing is happening. Its very nerve racking, considering, i know the moment i leave my computer the damn thing will start up and probably run into a string of errors LOL

    All that is left is design and adding options to the search, and since i built it on wordpress, that will be easy peasy

    update ya all soon