Yesterday for me was the second day of Summer of Code for Drupal. My
project is <a href="http://drupal.org/project/security_scanner">Security
scanner component for SimpleTest module</a>. chx, my mentor, and I
were working on a way to scan through all links on the site, checking
for XSS vulnerabilities. In order to do this, we have to build up a crawler that
catch all the links of the website and save them into a database. Then we
have to use these links to inject xss or sql seed into the forms, in order to
view if there are vulnerabilities or not. But now we're only at the first part of
the project, so we're catching links.
The basic idea is: make a scan of the page and save their links. Then
extract the first link that has status field "0" (default status).
Scan that url with a crawler and save its links again into the
database. This means that database table "crawler_links" is such a
queue of urls ready to be processed. After processing the page update
the record of that page changing its status (move it to "1").
If you put this into code:
<?php
db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1");
... process page
db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d", $link_id);
?>
However, if we want more than one crawler working at the same time,
then this is a classic example of a "race condition": if another
crawler is started while the "process page" part runs, then the second
crawler will process the very same page. Not good. We can extend the
meaning of the status field so that 1 means "processing began" and 2
means "processing finished", so:
<?php
db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1");
db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d AND
status = 0", $link_id);
... process page
db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id);
?>
Much better but if you are starting your crawlers from cron then it is
still possible that two parallel SELECTs will run before one crawler
has a chance to run an UPDATE. We need to make sure that this can't
happen. However, the most popular database (MySQL with MyISAM is) does
not support transactions. With transactions, the UPDATE would fail, so
we would need to re-run our two queries until the UPDATE succeeds.
Instead, we label our crawlers uniquely and do
db_query("UPDATE {crawler_links} SET crawler_id = %d, status = 1 WHERE
crawler_id = 0 LIMIT 1", $crawler_id);
$page_to_visit = db_fetch_array(db_query(
"SELECT id, path FROM
{crawler_links} WHERE crawler_id = %d AND status = 1 LIMIT 1",
$crawler_id));
... process page
db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id);
?>
How can we make sure the crawler_id is unique for sure? MySQL has
autoincrement fields: specifically, we have a crawler table with a
single "id" field. We INSERT into this table and the last insert id is
going to be the $crawler_id.
This permits the simultaneously cooperation of two or more crawlers,
without overlap.
And this is the beautiful environment of Summer of Code, learning to
think, solving problems in a state-of-art way. Only for opensorce and for cotton =) !