Beating Google With CouchDB, Celery and Whoosh (Part 2)
In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler.
CouchDB’s Python library has a simple ORM system that makes it easy to convert between the JSON objects stored in the database and a Python object.
To create the class you just need to specify the names of the fields, and their type. So, what do a search
engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we
last accessed the page. To make things easier we’ll also have a list of the urls that the page links to. One
of the great advantages of a database like CouchDB is that we don’t need to create a separate table to hold
the links, we can just include them directly in the main document. To help return the best pages we’ll use a
page rank like algorithm to rank the page, so we also need
to store that rank. Finally, as is good practice on CouchDB we’ll give the document a type
field so
we can write views that only target this document type.
class Page(Document):
type = TextField(default="page")
url = TextField()
content = TextField()
links = ListField(TextField())
rank = FloatField(default=0)
last_checked = DateTimeField(default=datetime.now)
That’s a lot of description for not a lot of code! Just add that class to your models.py
file. It’s
not a normal Django model, but we’re not using Django models in this project so it’s the right place to put
it.
We also need to keep track of the urls that we are and aren’t allowed to access. Fortunately for us Python
comes with a class, RobotFileParser which
handles the parsing of the file for us. So, this becomes a much simpler model. We just need the domain name,
and a pickled RobotFileParser instance. We also need
to know whether we’re accessing an http or https and we’ll give it type
field to distinguish it from
the Page
model.
class RobotsTxt(Document):
type = TextField(default="robotstxt")
domain = TextField()
protocol = TextField()
robot_parser_pickle = TextField()
We want to make the pickle/unpickle process transparent so we’ll create a property that hides the underlying pickle representation. CouchDB can’t store the binary pickle value, so we also base64 encode it and store that instead. If the object hasn’t been set yet then we create a new one on the first access.
def _get_robot_parser(self):
if self.robot_parser_pickle is not None:
return pickle.loads(base64.b64decode(self.robot_parser_pickle))
else:
parser = RobotFileParser()
parser.set_url(self.protocol + "://" + self.domain + "/robots.txt") self.robot_parser = parser
return parser
def _set_robot_parser(self, parser):
self.robot_parser_pickle = base64.b64encode(pickle.dumps(parser))
robot_parser = property(_get_robot_parser, _set_robot_parser)
For both pages and robots.txt
files we need to know whether we should reaccess the page. We’ll do
this by testing whether the we accessed the file in the last seven days of not. For Page models we do this by
adding the following function which implements this check.
def is_valid(self):
return (datetime.now() - self.last_checked).days < 7
For the RobotsTxt
we can take advantage of the last modified value stored in the
RobotFileParser
that we’re wrapping. This is a unix timestamp so the is_valid
function needs
to be a little bit different, but follows the same pattern.
def is_valid(self):
return (time.time() - self.robot_parser.mtime()) < 7*24*60*60
To update the stored copy of a robots.txt
we need to get the currently stored version, read a new
one, set the last modified timestamp and then write it back to the database. To avoid hitting the same server
too often we can use Django’s cache to store
a value for ten seconds, and sleep if that value already exists.
def update(self):
while cache.get(self.domain) is not None: time.sleep(1)
cache.set(self.domain, True, 10)
parser = self.robot_parser
parser.read()
parser.modified()
self.robot_parser = parser
self.store(settings.db)
Once we’ve updated the stored file we need to be able to query it. This function just passes the URL being tested through to the underlying model along with our user agent string.
def is_allowed(self, url):
return self.robot_parser.can_fetch(settings.USER_AGENT, url)
The final piece in our robots.txt
puzzle is a function to pull the write object out of the database.
We’ll need a view that has the protocol and domain for each file as the key.
@staticmethod
def get_by_domain(protocol, domain):
r = settings.db.view("robotstxt/by_domain", key=[protocol, domain])
We query that mapping and if it returns a value then we load the object. If it’s still valid then we can return right away, otherwise we need to update it.
if len(r) > 0:
doc = RobotsTxt.load(settings.db, r.rows[0].value)
if doc.is_valid():
return doc
If we’ve never loaded this domain’s robots.txt
file before then we need to create a blank object. The
final step is to read the file and store it in the database.
else:
doc = RobotsTxt(protocol=protocol, domain=domain)n doc.update()
return doc
For completeness, here is the map file required for this function.
function (doc) {
if(doc.type == "robotstxt") {
emit([doc.protocol, doc.domain], doc._id);
}
}
In this post we’ve discussed how to represent a webpage in our database as well as keep track of what paths we
are and aren’t allowed to access. We’ve also seen how to retrieve the robots.txt
files and update
them if they’re too old.
Now that we can test whether we’re allowed to access a URL in the next post in this series I’ll show you how to begin crawling the web and populating our database.
Read part 3.
Comments
Interesting idea, building a search index this way.
I do wonder about something, since you want to be "beating Google".
Why are you adding Whoosh to the stack, instead of something like SOLR?
Both can also be used from django-haystack if you want to build a search index.
Having seen a db index of +1000 records crawl under Whoosh, and fly with SOLR, I'd kindly suggest to reconsider something :)
Diederik van der Boor
04 Oct 2011
You're completely right, but I think that if I were serious about "beating Google" then even SOLR wouldn't cut it! SOLR's a great piece of tech, but it's not as simple as Whoosh to get up and running. For a few shortish blog posts I didn't want to have to spend ages describing how to configure SOLR, with Whoosh it's a few lines of code and you're away. Whoosh also has the advantage of being pure-Python, so it fits with Django and Celery.
Andrew Wilkinson
05 Oct 2011