Yesterday for me was the second day of Summer of Code for Drupal. Me and chx was trying to realize an algorithm that permit to more than one crawler to simultaneusly catch all links inside a page and to save them into the database. We predict fot that purpose two tables: one called “crawler” including only an “id” integer autoincrement field, and one called “crawler_links”, including “id”,”path”,”status” and “has_form”, that in the future will became useful to marke the presence of a form into the page.

The idea was: make a scan of the page and save their links. Then extract the first link that has status field “0” and crawler_id “0” (default status). Scan that url with a crawler and save its links again into the database. This means that database table “crawler_links” is such a queue of urls ready to be processed. After processing the page update the record of that page changing its status (move it to “1”) and its “crawler_id” value. For this purpose, crawler name needs to have atomicity features, that enable more than one crawler to execute this code without hamper the others. What we did was taking advance from atomicity of an auto increment fields (specifically crawler table, “id” field) to give univocal name to every crawler. At this point the algorithm could work good but we need to imagine completely the scenario with the presence of more than one crawler that together runs this algorithm.

The problem of the simultaneously derive from myisam. In fact, innodb supports trasations, so if we are able to use innodb we need anything but the algorithm. Unfortunately we can use only myisam so we need to pay attention because we can overlap operation between different crawlers.
In this example we show this:
db_query("SELECT id,path FROM {crawler_links} WHERE crawler_id = %d AND status = 0 LIMIT 1", $crawler_id));
db_query("UPDATE {crawler_links} SET crawler_id = %d WHERE crawler_id = 0 LIMIT 1", $crawler_id);
As you can see if two crawlers starts together, could happen that with SELECT we extract for two times the same crawler_id. Specifically, this happens when UPDATE is still in action but not yet finished, and the second crawler has finished the SELECT.  To solve this problem we change the order of the operations executing as first the UPDATE:
db_query("UPDATE {crawler_links} SET crawler_id = %d WHERE crawler_id = 0 LIMIT 1", $crawler_id);
then the SELECT to extract the link from the table:
$page_to_visit = db_fetch_array(db_query("SELECT id,path FROM {crawler_links} WHERE crawler_id = %d AND status = 0 LIMIT 1", $crawler_id));
an then again the UPDATE to change the status, assigning a value (“2”) that remove the link from the possible value extracted by the SELECT query.
This permits the simultaneously cooperation of two or more crawlers, without overlap.
And this is the beautiful environment of Summer of Code, learning to think, solving problems in a state-of-art way