iPhone 4S

17 Oct 2011

This weekend I joined the hysterical masses and upgraded my increasingly ancient iPhone 3G to a shiny new 64GB iPhone 4S. Except that it was actually a bit of an anticlimax. I went into my local O2 shop at about 10:30am on Saturday morning, the day after the launch, and purchased a phone. No queueing, no raging hoards. I didn’t even have to shove a granny out of the way to get one. However, after handing over my credit card while cringing at the expense it was back home to enjoy the famous Apple unboxing experience.

I wish I’d never upgraded my 3G to iOS 4.2. Up until that point it was a great phone. Afterwards it was slow and applications would repeated crash on start up. Did I mention it was slow?

It’s hard to express just how much quicker the 4S is compared to my 3G. Often just typing my the passcode would be too quick for the 3G and it would miss one of the numbers forcing me to go back. No danger of this with the 4G though. Application starting, browsing the web, taking photos are all super speedy.

Although it’s the same as the iPhone 4 the screen is still incredible. It’s so bright and sharp it’s really a joy to use. It really comes into its own when browsing webpages that are designed for bigger screens. The extra detail really helps you to work out where to zoom in.

Beating Google With CouchDB, Celery and Whoosh (Part 6)

13 Oct 2011

We’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another.

To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.

In the templates on this page we reference base.html which provides the boiler plate code needed to make an HTML page.

Beating Google With CouchDB, Celery and Whoosh (Part 5)

11 Oct 2011

In this post we’ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We’ll also build a index of our pages with Whoosh, a pure-Python full-text indexer and query engine.

To calculate the rank of a page we need to know what other pages link to a given url, and how many links that page has. The code below is a CouchDB map called page/links_to_url. For each page this will output a row for each link on the page with the url linked to as the key and the page’s rank and number of links as the value.

function (doc) {
    if(doc.type == "page") {
        for(i = 0; i < doc.links.length; i++) {
            emit(doc.links[i], [doc.rank, doc.links.length]);
        }
    }
}

Beating Google With CouchDB, Celery and Whoosh (Part 4)

06 Oct 2011

In this series I’m showing you how to build a webcrawler and search engine using standard Python based tools like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed and stored robots.txt and stored a single webpage in our document. In this post I’ll show you how to parse out the links from our stored HTML document so we can complete the crawler, and we’ll start calculating the rank for the pages in our database.

There are several different ways of parsing out the links in a given HTML document. You can just use a regular expression to pull the urls out, or you can use a more complete but also more complicated (and slower) method of parsing the HTML using the standard Python htmlparser library, or the wonderful Beautiful Soup. The point of this series isn’t to build a complete webcrawler, but to show you the basic building blocks. So, for simplicity’s sake I’ll use a regular expression.

link_single_re = re.compile(r"<a[^>]+href='([^']+)'")
link_double_re = re.compile(r'<a[^>]+href="([^"]+)"')

All we need to look for an href attribute in an a tag. We’ll use two regular expressions to handle single and double quotes, and then build a list containing all the links in the document.

Beating Google With CouchDB, Celery and Whoosh (Part 3)

04 Oct 2011

In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll start crawling the web and filling our database with the contents of pages.

One of the rules we set down was to not request a page too often. If, by accident, we try to retrieve a page more than once a week then don’t want that request to actually make it to the internet. To help prevent this we’ll extend the Page class we created in the last post with a function called get_by_url. This static method will take a url and return the Page object that represents it, retrieving the page if we don’t already have a copy. You could create this as an independent function, but I prefer to use static methods to keep things tidy.

We only actually want to retrieve the page from the internet in one of the three tasks the we’re going to create so we’ll give get_by_url a parameter, update that enables us to return None if we don’t have a copy of the page.

@staticmethod
def get_by_url(url, update=True):
    r = settings.db.view("page/by_url", key=url)
    if len(r.rows) == 1:
        doc = Page.load(settings.db, r.rows[0].value)
        if doc.is_valid():
            return doc
    elif not update:
        return None
    else:
        doc = Page(url=url)
        doc.update()
        return doc

The key line in the static method is doc.update(). This calls the function to retrieves the page and makes sure we respect the robots.txt file. Let’s look at what happens in that function.

Older Newer

Andrew Wilkinson

iPhone 4S

Beating Google With CouchDB, Celery and Whoosh (Part 6)

Beating Google With CouchDB, Celery and Whoosh (Part 5)

Beating Google With CouchDB, Celery and Whoosh (Part 4)

Beating Google With CouchDB, Celery and Whoosh (Part 3)