DrupalBin
Submit Code
About
Recent Posts
admin settings not saving
1 hour 16 min
ago
Code
2 hours 4 sec
ago
Code
2 hours 15 min
ago
css path
4 hours 6 sec
ago
more
Tags
CCK
drupal
fapi
jquery
menu
module
Panels
php
simpletest
test
theme
views
more tags
User login
Log in using OpenID:
What is OpenID?
Username:
*
Password:
*
Create new account
Request new password
Log in using OpenID
Cancel OpenID login
Home
Fix for ***This is not code*** Article for Blog and Planet-Soc
View
Download
Fix
This fix will not be saved to the database until you submit.
Summary:
Tags:
Any tags you'd like to associate with your code, delimitered by commas (example: Views, CCK, Module, etc).
Source code:
*
Yesterday for me was the second day of Summer of Code for Drupal. Me and chx was trying to realize an algorithm that permit to more than one crawler to simultaneusly catch all links inside a page and to save them into the database. We predict fot that purpose two tables: one called “crawler” including only an “id” integer autoincrement field, and one called “crawler_links”, including “id”,”path”,”status” and “has_form”, that in the future will became useful to marke the presence of a form into the page. The idea was: make a scan of the page and save their links. Then extract the first link that has status field “0” and crawler_id “0” (default status). Scan that url with a crawler and save its links again into the database. This means that database table “crawler_links” is such a queue of urls ready to be processed. After processing the page update the record of that page changing its status (move it to “1”) and its “crawler_id” value. For this purpose, crawler name needs to have atomicity features, that enable more than one crawler to execute this code without hamper the others. What we did was taking advance from atomicity of an auto increment fields (specifically crawler table, “id” field) to give univocal name to every crawler. At this point the algorithm could work good but we need to imagine completely the scenario with the presence of more than one crawler that together runs this algorithm. The problem of the simultaneously derive from myisam. In fact, innodb supports trasations, so if we are able to use innodb we need anything but the algorithm. Unfortunately we can use only myisam so we need to pay attention because we can overlap operation between different crawlers. In this example we show this: db_query("SELECT id,path FROM {crawler_links} WHERE crawler_id = %d AND status = 0 LIMIT 1", $crawler_id)); db_query("UPDATE {crawler_links} SET crawler_id = %d WHERE crawler_id = 0 LIMIT 1", $crawler_id); As you can see if two crawlers starts together, could happen that with SELECT we extract for two times the same crawler_id. Specifically, this happens when UPDATE is still in action but not yet finished, and the second crawler has finished the SELECT. To solve this problem we change the order of the operations executing as first the UPDATE: db_query("UPDATE {crawler_links} SET crawler_id = %d WHERE crawler_id = 0 LIMIT 1", $crawler_id); then the SELECT to extract the link from the table: $page_to_visit = db_fetch_array(db_query("SELECT id,path FROM {crawler_links} WHERE crawler_id = %d AND status = 0 LIMIT 1", $crawler_id)); an then again the UPDATE to change the status, assigning a value (“2”) that remove the link from the possible value extracted by the SELECT query. This permits the simultaneously cooperation of two or more crawlers, without overlap. And this is the beautiful environment of Summer of Code, learning to think, solving problems in a state-of-art way
Syntax highlighting mode:
ActionScript
ColdFusion
Diff
Drupal
Drupal 5
Drupal 6
HTML
Javascript
MySQL
PHP
Python
robots.txt
SQL
Text
Select the syntax highlighting mode to use.