Yesterday for me was the second day of Summer of Code for Drupal. Me and chx was trying to realize an algorithm that permit to more than one crawler to simultaneusly catch all links inside a page and to save them into the database. We predict fot that purpose two tables: one called “crawler” including only an “id” integer autoincrement field, and one called “crawler_links”, including “id”,”path”,”status” and “has_form”, that in the future will became useful to marke the presence of a form into the page.
The idea was: make a scan of the page and save their links. Then extract the first link that has status field “0” (default status). Scan that url with a crawler and save its links again into the database. This means that database table “crawler_links” is such a queue of urls ready to be processed. After processing the page update the record of that page changing its status (move it to “1”).
If you put this into code:
<?php
db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1");
... process page
db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d", $link_id);
?>
This is a classic example of a "race condition": if another crawler is started while the "process page" part runs, then the second crawler will process the very same page. Not good. We can extend the meaning of the status field so that 1 means "processing began" and 2 means "processing finished", so:
<?php
db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1");
db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d AND status = 0", $link_id);
... process page
db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id);
?>
Much better but if you are starting your crawlers from cron then it is still possible that two parallel SELECTs will run before one crawler has a chance to run an UPDATE. We need to make sure that this can't happen. However, the most popular database (MySQL with MyISAM is) does not support transactions. With transactions, the UPDATE would fail, so we would need to re-run our two queries until the UPDATE succeeds.
Instead, we label our crawlers uniquely and do
db_query("UPDATE {crawler_links} SET crawler_id = %d, status = 1 WHERE crawler_id = 0 LIMIT 1", $crawler_id);
$page_to_visit = db_fetch_array(db_query("SELECT id, path FROM {crawler_links} WHERE crawler_id = %d AND status = 1 LIMIT 1", $crawler_id));
... process page
db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id);
?>
How can we make sure the crawler_id is unique for sure? MySQL has autoincrement fields: specifically, we have a crawler table with a single “id” field. We INSERT into this table and the last insert id is going to be the $crawler_id.
This permits the simultaneously cooperation of two or more crawlers, without overlap.
And this is the beautiful environment of Summer of Code, learning to think, solving problems in a state-of-art way