Fix for Fix for ***This is not code*** Article for Blog and Planet-Soc

  1. Yesterday for me was the second day of Summer of Code for Drupal. My
  2. project is <a href="http://drupal.org/project/security_scanner">Security
  3. scanner component for SimpleTest module</a>. chx, my mentor, and I
  4. were working on a way to scan through all links on the site, checking
  5. for XSS vulnerabilities. In order to do this, we have to build up a crawler that
  6. catch all the links of the website and save them into a database. Then we
  7. have to use this links to inject xss or sql seed into the forms, in order to
  8. view if there are vulnerabilities or not. But now we're only at the first part of
  9. the project, so we're catching links.
  10.  
  11. The basic idea is: make a scan of the page and save their links. Then
  12. extract the first link that has status field "0" (default status).
  13. Scan that url with a crawler and save its links again into the
  14. database. This means that database table "crawler_links" is such a
  15. queue of urls ready to be processed. After processing the page update
  16. the record of that page changing its status (move it to "1").
  17.  
  18. If you put this into code:
  19.  
  20. <?php
  21. db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1");
  22. ... process page
  23. db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d", $link_id);
  24. ?>
  25.  
  26. However, if we want more than crawler working on at the same time,
  27. then this is a classic example of a "race condition": if another
  28. crawler is started while the "process page" part runs, then the second
  29. crawler will process the very same page. Not good. We can extend the
  30. meaning of the status field so that 1 means "processing began" and 2
  31. means "processing finished", so:
  32.  
  33. <?php
  34. db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1");
  35. db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d AND
  36. status = 0", $link_id);
  37. ... process page
  38. db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id);
  39. ?>
  40.  
  41. Much better but if you are starting your crawlers from cron then it is
  42. still possible that two parallel SELECTs will run before one crawler
  43. has a chance to run an UPDATE. We need to make sure that this can't
  44. happen. However, the most popular database (MySQL with MyISAM is) does
  45. not support transactions. With transactions, the UPDATE would fail, so
  46. we would need to re-run our two queries until the UPDATE succeeds.
  47.  
  48. Instead, we label our crawlers uniquely and do
  49.  
  50. db_query("UPDATE {crawler_links} SET crawler_id = %d, status = 1 WHERE
  51. crawler_id = 0 LIMIT 1", $crawler_id);
  52. $page_to_visit = db_fetch_array(db_query(
  53. "SELECT id, path FROM
  54. {crawler_links} WHERE crawler_id = %d AND status = 1 LIMIT 1",
  55. $crawler_id));
  56. ... process page
  57. db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id);
  58. ?>
  59.  
  60. How can we make sure the crawler_id is unique for sure? MySQL has
  61. autoincrement fields: specifically, we have a crawler table with a
  62. single "id" field. We INSERT into this table and the last insert id is
  63. going to be the $crawler_id.
  64.  
  65. This permits the simultaneously cooperation of two or more crawlers,
  66. without overlap.
  67. And this is the beautiful environment of Summer of Code, learning to
  68. think, solving problems in a state-of-art way. Only for opensorce and for cotton =) !

Submit Fix

Any tags you'd like to associate with your code, delimitered by commas (example: Views, CCK, Module, etc).
Select the syntax highlighting mode to use.