Beating Google With CouchDB, Celery and Whoosh (Part 4)

In this series I’m showing you how to build a webcrawler and search engine using standard Python based tools like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed and stored robots.txt and stored a single webpage in our document. In this post I’ll show you how to parse out the links from our stored HTML document so we can complete the crawler, and we’ll start calculating the rank for the pages in our database.

There are several different ways of parsing out the links in a given HTML document. You can just use a regular expression to pull the urls out, or you can use a more complete but also more complicated (and slower) method of parsing the HTML using the standard Python htmlparser library, or the wonderful Beautiful Soup. The point of this series isn’t to build a complete webcrawler, but to show you the basic building blocks. So, for simplicity’s sake I’ll use a regular expression.

link_single_re = re.compile(r"<a[^>]+href='([^']+)'")
link_double_re = re.compile(r'<a[^>]+href="([^"]+)"')

All we need to look for an href attribute in an a tag. We’ll use two regular expressions to handle single and double quotes, and then build a list containing all the links in the document.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 3)

In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll start crawling the web and filling our database with the contents of pages.

One of the rules we set down was to not request a page too often. If, by accident, we try to retrieve a page more than once a week then don’t want that request to actually make it to the internet. To help prevent this we’ll extend the Page class we created in the last post with a function called get_by_url. This static method will take a url and return the Page object that represents it, retrieving the page if we don’t already have a copy. You could create this as an independent function, but I prefer to use static methods to keep things tidy.

We only actually want to retrieve the page from the internet in one of the three tasks the we’re going to create so we’ll give get_by_url a parameter, update that enables us to return None if we don’t have a copy of the page.

@staticmethod
def get_by_url(url, update=True):
    r = settings.db.view("page/by_url", key=url)
    if len(r.rows) == 1:
        doc = Page.load(settings.db, r.rows[0].value)
        if doc.is_valid():
            return doc
    elif not update:
        return None
    else:
        doc = Page(url=url)
        doc.update()
        return doc

The key line in the static method is doc.update(). This calls the function to retrieves the page and makes sure we respect the robots.txt file. Let’s look at what happens in that function.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 2)

In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler.

CouchDB’s Python library has a simple ORM system that makes it easy to convert between the JSON objects stored in the database and a Python object.

To create the class you just need to specify the names of the fields, and their type. So, what do a search engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we last accessed the page. To make things easier we’ll also have a list of the urls that the page links to. One of the great advantages of a database like CouchDB is that we don’t need to create a separate table to hold the links, we can just include them directly in the main document. To help return the best pages we’ll use a page rank like algorithm to rank the page, so we also need to store that rank. Finally, as is good practice on CouchDB we’ll give the document a type field so we can write views that only target this document type.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 1)

Ok, let’s get this out of the way right at the start - the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend.

Celery is a message passing library that makes it really easy to run background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database CouchDB as a possible backend. I’m a huge fan of CouchDB, and the idea of running both my database and message passing backend on the same software really appealed to me. Unfortunately the documentation doesn’t make it clear what you need to do to get CouchDB working, and what the downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to experiment with using them together.

In this series I’m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I’m also going to show you how to use Celery to create a web-crawler. We’ll then index the crawled pages using Whoosh and use a PageRank-like algorithm to help rank the results. Finally, we’ll attach a simple Django frontend to the search engine for querying it.

Read More...

Crowd Sourcing Mapping

Recently Google announced that they were making their crowd sourcing mapping tools available to users in the United States. This tool lets uses edit Google Maps, adding businesses and even roads, railways and rivers. This raises interesting questions about whether wisdom of the crowd can be applied to data that requires a high degree of accuracy.

Open Street Map has been doing this since 2004, and has put together an amazing resource of free map data, but only recently has Google begun to allow people to edit its maps for large parts of the world.

Accurate mapping data is terribly important. While the majority of Google Maps queries are likely to be “how do I get from my house to my aunt’s?” some are much more important. A war was almost caused when the border between Nicaraguan and Costa Rica was incorrectly placed. While a war is a little far-fetched, it’s not hard to imagine how a mistake on map could cost someone’s life in a medical emergency.

Read More...