21 Oct 2011
In the previous seven posts I’ve gone through all the stages in building a search engine. If you want to try
and run it for yourself and tweak it to make it even better then you can. I’ve put the
code up on GitHub. All I ask is that if you beat Google,
you give me a credit somewhere.
When you’ve downloaded the code it should prove to be quite simple to get running. First you’ll need to edit
settings.py. It should work out of the box, but you should change the USER_AGENT
setting to something
unique. You may also want to adjust some of the other settings, such as the database connection or CouchDB
urls.n To set up the CouchDB views type python manage.py update_couchdb
.
Next, to run the celery daemon you’ll need to type the following two commands:
python manage.py celeryd -Q retrieve
python manage.py celeryd -Q process
This sets up the daemons to monitor the two queues and process the tasks. As mentioned in a previous post
two queues are needed to prevent one set of tasks from swamping the other.
Read More...
19 Oct 2011
The key ingredients of our search engine are now in place, but we face a problem. We can download webpages and
store them in CouchDB. We can rank them in order of importance and
query them using Whoosh but the internet is big,
really big! A
single server doesn’t even come close to being able to hold all the information that you would want it to -
Google has an estimated
900,000
servers. So how do we scale this the software we’ve written so far effectively?
The reason I started writing this series was to investigate how well Celery’s integration with CouchDB works.
This gives us an immediate win in terms of scaling as we don’t need to worry about a different backend, such
as RabbitMQ. Celery itself is designed to scale so we can run
celeryd
daemons as many boxes as we like and the jobs will be divided amongst them. This means that
our indexing and ranking processes will scale easily.
CouchDB is not designed to scale across multiple machines, but there is some mature software,
CouchDB-lounge that does just that. I won’t go into how
to get set this up but fundamentally you set up a proxy that sits in front of your CouchDB cluster and shards
the data across the nodes. It deals with the job of merging view results and managing where the data is
actually stored so you don’t have to. O’Reilly’s CouchDB: The Definitive Guide has a chapter
on clustering that is well worth a read.
Read More...
17 Oct 2011
This weekend I joined the hysterical masses and upgraded my increasingly ancient iPhone 3G to a shiny new 64GB
iPhone 4S. Except that it was actually a bit of an anticlimax. I went into my local
O2 shop at about 10:30am on Saturday morning, the day after the launch, and
purchased a phone. No queueing, no raging hoards. I didn’t even have to shove a granny out of the way to get
one. However, after handing over my credit card while cringing at the expense it was back home to enjoy the
famous Apple unboxing experience.
I wish I’d never upgraded my 3G to iOS 4.2. Up until that point it was a great phone. Afterwards it was slow
and applications would repeated crash on start up. Did I mention it was slow?
It’s hard to express just how much quicker the 4S is compared to my 3G. Often just typing my the passcode
would be too quick for the 3G and it would miss one of the numbers forcing me to go back. No danger of this
with the 4G though. Application starting, browsing the web, taking photos are all super speedy.
Although it’s the same as the iPhone 4 the screen is still incredible. It’s so bright and sharp it’s really a
joy to use. It really comes into its own when browsing webpages that are designed for bigger screens. The
extra detail really helps you to work out where to zoom in.
Read More...
13 Oct 2011
We’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in
this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very
simple, just a text box on one page and a list of results on another.
To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also
include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.
In the templates on this page we reference base.html
which provides the boiler plate code needed to
make an HTML page.
Read More...
11 Oct 2011
In this post we’ll continue building the backend for our search engine by implementing the algorithm we
designed in the last post for ranking pages. We’ll also build a index of our pages with
Whoosh, a pure-Python full-text indexer and
query engine.
To calculate the rank of a page we need to know what other pages link to a given url, and how many links that
page has. The code below is a CouchDB map called page/links_to_url
. For each page this will output a
row for each link on the page with the url linked to as the key and the page’s rank and number of links as the
value.
function (doc) {
if(doc.type == "page") {
for(i = 0; i < doc.links.length; i++) {
emit(doc.links[i], [doc.rank, doc.links.length]);
}
}
}
Read More...