DrupalBin
Submit Code
About
Recent Posts
admin settings not saving
1 hour 8 min
ago
Code
1 hour 52 min
ago
Code
2 hours 7 min
ago
css path
3 hours 52 min
ago
more
Tags
CCK
drupal
fapi
jquery
menu
module
Panels
php
simpletest
test
theme
views
more tags
User login
Log in using OpenID:
What is OpenID?
Username:
*
Password:
*
Create new account
Request new password
Log in using OpenID
Cancel OpenID login
Home
Fix for ***This is not code*** Article for Blog and Planet-Soc
View
Download
Fix
This fix will not be saved to the database until you submit.
Summary:
Tags:
Any tags you'd like to associate with your code, delimitered by commas (example: Views, CCK, Module, etc).
Source code:
*
Yesterday for me was the second day of Summer of Code for Drupal. My project is <a href="http://drupal.org/project/security_scanner">Security scanner component for SimpleTest module</a>. chx, my mentor, and I were working on a way to scan through all links on the site, checking for XSS vulnerabilities. In order to do this, we have to build up a crawler that catch all the links of the website and save them into a database. Then we have to use this links to inject xss or sql seed into the forms, in order to view if there are vulnerabilities or not. But now we're only at the first part of the project, so we're catching links. The basic idea is: make a scan of the page and save their links. Then extract the first link that has status field "0" (default status). Scan that url with a crawler and save its links again into the database. This means that database table "crawler_links" is such a queue of urls ready to be processed. After processing the page update the record of that page changing its status (move it to "1"). If you put this into code: <?php db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1"); ... process page db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d", $link_id); ?> However, if we want more than crawler working on at the same time, then this is a classic example of a "race condition": if another crawler is started while the "process page" part runs, then the second crawler will process the very same page. Not good. We can extend the meaning of the status field so that 1 means "processing began" and 2 means "processing finished", so: <?php db_query("SELECT link_id, path FROM {crawler_links} WHERE status = 0 LIMIT 1"); db_query("UPDATE {crawler_links} SET status = 1 WHERE link_id = %d AND status = 0", $link_id); ... process page db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id); ?> Much better but if you are starting your crawlers from cron then it is still possible that two parallel SELECTs will run before one crawler has a chance to run an UPDATE. We need to make sure that this can't happen. However, the most popular database (MySQL with MyISAM is) does not support transactions. With transactions, the UPDATE would fail, so we would need to re-run our two queries until the UPDATE succeeds. Instead, we label our crawlers uniquely and do db_query("UPDATE {crawler_links} SET crawler_id = %d, status = 1 WHERE crawler_id = 0 LIMIT 1", $crawler_id); $page_to_visit = db_fetch_array(db_query( "SELECT id, path FROM {crawler_links} WHERE crawler_id = %d AND status = 1 LIMIT 1", $crawler_id)); ... process page db_query("UPDATE {crawler_links} SET status = 2 WHERE link_id = %d", $link_id); ?> How can we make sure the crawler_id is unique for sure? MySQL has autoincrement fields: specifically, we have a crawler table with a single "id" field. We INSERT into this table and the last insert id is going to be the $crawler_id. This permits the simultaneously cooperation of two or more crawlers, without overlap. And this is the beautiful environment of Summer of Code, learning to think, solving problems in a state-of-art way. Only for opensorce and for cotton =) !
Syntax highlighting mode:
ActionScript
ColdFusion
Diff
Drupal
Drupal 5
Drupal 6
HTML
Javascript
MySQL
PHP
Python
robots.txt
SQL
Text
Select the syntax highlighting mode to use.