Beating Google With CouchDB, Celery and Whoosh (Part 8)

In the previous seven posts I’ve gone through all the stages in building a search engine. If you want to try and run it for yourself and tweak it to make it even better then you can. I’ve put the code up on GitHub. All I ask is that if you beat Google, you give me a credit somewhere.

When you’ve downloaded the code it should prove to be quite simple to get running. First you’ll need to edit settings.py. It should work out of the box, but you should change the USER_AGENT setting to something unique. You may also want to adjust some of the other settings, such as the database connection or CouchDB urls.n To set up the CouchDB views type python manage.py update_couchdb.

Next, to run the celery daemon you’ll need to type the following two commands:

python manage.py celeryd -Q retrieve
python manage.py celeryd -Q process

This sets up the daemons to monitor the two queues and process the tasks. As mentioned in a previous post two queues are needed to prevent one set of tasks from swamping the other.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 7)

The key ingredients of our search engine are now in place, but we face a problem. We can download webpages and store them in CouchDB. We can rank them in order of importance and query them using Whoosh but the internet is big, really big! A single server doesn’t even come close to being able to hold all the information that you would want it to - Google has an estimated 900,000 servers. So how do we scale this the software we’ve written so far effectively?

The reason I started writing this series was to investigate how well Celery’s integration with CouchDB works. This gives us an immediate win in terms of scaling as we don’t need to worry about a different backend, such as RabbitMQ. Celery itself is designed to scale so we can run celeryd daemons as many boxes as we like and the jobs will be divided amongst them. This means that our indexing and ranking processes will scale easily.

CouchDB is not designed to scale across multiple machines, but there is some mature software, CouchDB-lounge that does just that. I won’t go into how to get set this up but fundamentally you set up a proxy that sits in front of your CouchDB cluster and shards the data across the nodes. It deals with the job of merging view results and managing where the data is actually stored so you don’t have to. O’Reilly’s CouchDB: The Definitive Guide has a chapter on clustering that is well worth a read.

Read More...

iPhone 4S

Apple Store - London

This weekend I joined the hysterical masses and upgraded my increasingly ancient iPhone 3G to a shiny new 64GB iPhone 4S. Except that it was actually a bit of an anticlimax. I went into my local O2 shop at about 10:30am on Saturday morning, the day after the launch, and purchased a phone. No queueing, no raging hoards. I didn’t even have to shove a granny out of the way to get one. However, after handing over my credit card while cringing at the expense it was back home to enjoy the famous Apple unboxing experience.

I wish I’d never upgraded my 3G to iOS 4.2. Up until that point it was a great phone. Afterwards it was slow and applications would repeated crash on start up. Did I mention it was slow?

It’s hard to express just how much quicker the 4S is compared to my 3G. Often just typing my the passcode would be too quick for the 3G and it would miss one of the numbers forcing me to go back. No danger of this with the 4G though. Application starting, browsing the web, taking photos are all super speedy.

Although it’s the same as the iPhone 4 the screen is still incredible. It’s so bright and sharp it’s really a joy to use. It really comes into its own when browsing webpages that are designed for bigger screens. The extra detail really helps you to work out where to zoom in.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 6)

We’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another.

To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.

In the templates on this page we reference base.html which provides the boiler plate code needed to make an HTML page.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 5)

In this post we’ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We’ll also build a index of our pages with Whoosh, a pure-Python full-text indexer and query engine.

To calculate the rank of a page we need to know what other pages link to a given url, and how many links that page has. The code below is a CouchDB map called page/links_to_url. For each page this will output a row for each link on the page with the url linked to as the key and the page’s rank and number of links as the value.

function (doc) {
    if(doc.type == "page") {
        for(i = 0; i < doc.links.length; i++) {
            emit(doc.links[i], [doc.rank, doc.links.length]);
        }
    }
}
Read More...