Scraping the Web in Golang with Colly and Goquery

If told to write a web crawler, the tools at the top of my mind would be Python based: BeautifulSoup or Scrapy. However, the ecosystem for writing web scrapers and crawlers in Go is quite robust. In particular, Colly and Goquery are extremely powerful tools that afford a similar amount of expressiveness and flexibility to their Python-based counterparts.

A Brief Introduction to Web Crawling

What is a web crawler? Essentially, a web crawler works by inspecting the HTML content of web pages and performing some type of action based on that content. Usually, pages are scraped for outbound links, which the crawler places in a queue to visit. We can also save data extracted from the current page. For example, if our crawler lands on a Wikipedia page, we may save that page’s text and title.

The simplest web crawlers perform the following algorithm:

initialize Queue
enqueue SeedURL

while Queue is not empty:
    URL = Pop element from Queue
    Page = Visit(URL)
    Links = ExtractLinks(Page)
    Enqueue Links on Queue

Our Visit and ExtractLinks functions are what changes; both are application specific. We might have a crawler that tries to interpret the entire graph of the web, like Google does, or something simple that just scrapes Wikipedia.

Things quickly become more complicated as your use case grows. Want many, many more pages to be scraped? You might have to start looking into a more sophisticated crawler that runs in parallel. Want to scrape more complicated pages? You may need to find a more powerful HTML parser.

Colly

Colly is a flexible framework for writing web crawlers in Go. It’s very much batteries-included. Out of the box, you get support for: * Rate limiting * Parallel crawling * Respecting robots.txt * HTML/Link parsing

The fundamental component of a Colly crawler is a “Collector”. Collectors keep track of pages that are queued to visit, and maintain callbacks for when a page is being scraped.

Setup

Creating a Colly collector is simple, but we have lots of options that we may elect to use:

c := colly.NewCollector(
    // Restrict crawling to specific domains
    colly.AllowedDomains("godoc.org"),
    // Allow visiting the same page multiple times
    colly.AllowURLRevisit(),
    // Allow crawling to be done in parallel / async
    colly.Async(true),
)

Of course, you can also just stick with a bare colly.NewCollector() and handle these addons yourself.

We might also want to place specific limits on our crawler’s behavior to be good web citizens. Colly makes it easy to introduce rate limiting:

c.Limit(&colly.LimitRule{
    // Filter domains affected by this rule
    DomainGlob:  "godoc.org/*",
    // Set a delay between requests to these domains
    Delay: 1 * time.Second
    // Add an additional random delay
    RandomDelay: 1 * time.Second,
})

Some websites are more picky than others when it comes to the amount of traffic they allow before cutting you off. Generally, setting a delay of a couple seconds should keep you off the “naughty list”.

From here, we can start our collector by seeding it with a URL:

c.Visit("https://godoc.org")

OnHTML

We have a collector that plays nice which can start at an arbitrary website. Now, we want our collector to do something — it needs to inspect pages so it can extract links and other data.

The colly.Collector.OnHTML method allows you to register a callback for when the collector reaches a portion of a page that matches a specific HTML tag specifier. For starters, we can get a callback whenever our crawler sees an <a> tag that contains an href link.

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    // Extract the link from the anchor HTML element    
    link := e.Attr("href")
    // Tell the collector to visit the link
    c.Visit(e.Request.AbsoluteURL(link))
})

As seen above, in the callback you’re given a colly.HTMLElement that contains the matching HTML data.

Now, we have the beginnings of an actual web crawler: we find links on the pages we visit, and tell our collector to visit those links in subsequent requests.

OnHTML is a powerful tool. It can search for CSS selectors (i.e. div.my_fancy_class or #someElementId), and you can attach multiple OnHTML callbacks to your collector to handle different page types.

Colly’s HTMLElement struct is quite useful. In addition to getting attributes with the Attr function, you can also extract text. For example, we may want to print a page’s title:

c.OnHTML("title", func(e *colly.HTMLElement) {
    fmt.Println(e.Text)
})

OnRequest / OnResponse

There may be times when you don’t need a specific HTML element from a page, but instead want to know when your crawler is about to retrieve or has just retrieved a page. For this, Colly exposes the OnRequest and OnResponse callbacks.

All of these callbacks will be called for each visited page. As for how this fits in with OnHTML. Here is the order in which callbacks are called per page: 1. OnRequest 2. OnResponse 3. OnHTML 4. OnScraped (not referenced in this post, but may be useful to you)

Of particular use is the ability to abort a request within the OnRequest callback. This may be useful for when you want your collector to stop.

numVisited := 0
c.OnRequest(func(r *colly.Request) {
    if numVisited > 100 {
        r.Abort()
    }
    numVisited++
})

In OnResponse, you have access to the entire HTML document, which could be useful in certain contexts:

c.OnResponse(func(r *colly.Response) {
    fmt.Println(r.Body)
})

HTMLElement

In addition to the Attr() method and Text property that colly.HTMLElement has, we can also use it to traverse child elements. The ChildAttr(), ChildText(), and ForEach() methods in particular are quite useful.

For example, we can use ChildText() to get the text of all the paragraphs in a section:

c.OnHTML("#myCoolSection", func(e *colly.HTMLElement) {
    fmt.Println(e.ChildText("p"))
})

And we can use ForEach() to iterate over an elements children that match a specific selector:

c.OnHTML("#myCoolSection", func(e *colly.HTMLElement) {
    e.ForEach("p", func(_ int, elem *colly.HTMLElement) {
        if strings.Contains(elem.Text, "golang") {
            fmt.Println(elem.Text)
        }    
    })
})

Bringing in Goquery

Colly’s built-in HTMLElement is useful for most scraping tasks, but if we want to do particularly advanced traversals of a DOM, we’ll have to look elsewhere. For example, there’s no way (currently) to traverse up the DOM to parent elements or traverse laterally through sibling elements.

Enter Goquery, “like that j-thing, only in Go”. It’s basically jQuery. In Go. (Which is awesome) For anything you’d like to scrape from an HTML document, it can probably be done using Goquery.

While Goquery is modeled off of jQuery, I found it to be pretty similar in many respects to the BeautifulSoup API. So, if you’re coming from the Python scraping world, then you’ll probably find yourself comfortable with Goquery.

Goquery allows us to do more complicated HTML selections and DOM traversals than Colly’s HTMLElement affords. For example, we may want to find the sibling elements of our anchor element, to get some context around the link we’ve scraped:

dom, _ := qoquery.NewDocument(htmlData)
dom.Find("a").Siblings().Each(func(i int, s *goquery.Selection) {
    fmt.Printf("%d, Sibling text: %s\n", i, s.Text())
})

Also, we can easily find the parent of a selected element. This might be useful if we’re given an anchor tag from Colly, and we want to find the content of the pages <title> tag:

anchor.ParentsUntil("~").Find("title").Text()

ParentsUntil traverses up the DOM until it finds something that matches the passed selector. We can use ~ to traverse all the way up to the top of the DOM, which then allows us to easily grab the title tag.

This is really just scratching the surface of what Goquery can do. So far, we’ve seen examples of DOM traversal, but Goquery also has robust support for DOM manipulation — editing text, adding/removing classes or properties, inserting/removing HTML elements, etc.

Bringing it back to web scraping, how do we use Goquery with Colly? It’s straightforward: each Colly HTMLElement contains a Goquery selection, which you can access through the DOM property.

c.OnHTML("div", func(e *colly.HTMLElement) {
    // Goquery selection of the HTMLElement is in e.DOM
    goquerySelection := e.DOM

    // Example Goquery usage
    fmt.Println(qoquerySelection.Find(" span").Children().Text())
})

It’s worth noting that most scraping tasks can be framed in such a way that you don’t need to use Goquery! Simply add an OnHTML callback for html, and you can get access to the entire page that way. However, I still found that Goquery was a nice addition to my DOM traversal toolbelt.

Writing a full web crawler

Using Colly and Goquery1, we can pretty easily piece together a simple web crawler.

With all the pieces explored above, we can write a simple web crawler that scrapes Emojipedia for emoji descriptions.

package main

import (
    "fmt"
    "strings"
    "time"

    "github.com/PuerkitoBio/goquery"
    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("emojipedia.org"),
    )

    // Callback for when a scraped page contains an article element
    c.OnHTML("article", func(e *colly.HTMLElement) {
        isEmojiPage := false

        // Extract meta tags from the document
        metaTags := e.DOM.ParentsUntil("~").Find("meta")
        metaTags.Each(func(_ int, s *goquery.Selection) {
            // Search for og:type meta tags
            property, _ := s.Attr("property")
            if strings.EqualFold(property, "og:type") {
                content, _ := s.Attr("content")

                // Emoji pages have "article" as their og:type
                isEmojiPage = strings.EqualFold(content, "article")
            }
        })

        if isEmojiPage {
            // Find the emoji page title
            fmt.Println("Emoji: ", e.DOM.Find("h1").Text())
            // Grab all the text from the emoji's description
            fmt.Println(
                "Description: ",
                e.DOM.Find(".description").Find("p").Text())
        }
    })

    // Callback for links on scraped pages
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        // Extract the linked URL from the anchor tag
        link := e.Attr("href")
        // Have our crawler visit the linked URL
        c.Visit(e.Request.AbsoluteURL(link))
    })

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        RandomDelay: 1 * time.Second,
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    c.Visit("https://emojipedia.org")
}

Full code can be found here on Github.

And that’s it! After compiling and running, you’ll see the crawler visiting a number of pages, and print out emoji names / descriptions when it stumbles onto an emoji page.

Clearly, this is just the beginning. One could easily save this data in a graph structure, or expose a web parser/scraper as a distinct package for a site that doesn’t have a public API.

The nice thing about Colly is that it scales with your use case. On the more advanced end of the spectrum, it supports using Redis as a backend for holding queued pages, parallel scraping, and the use of multiple collectors running simultaneously.

Where to Go from here?

The Colly documentation website is a great resource, and has tons of practicable examples. Colly’s Godoc and Goquery’s Godoc are also good places to look.

You may also find gocrawl to be worth looking into. It’s written by the same developer that created Goquery.

Footnotes


  1. I got a very nice followup from the creator of colly, asciimoo, who pointed out that this example could be done entirely with colly’s HTMLElement. Below is his implimentation of the emoji page scraper that works without calling out to Goquery:

    c.OnHTML("html", func(e *colly.HTMLElement) {
        if strings.EqualFold(e.ChildAttr(`meta[property="og:type"]`, "content"), "article") {
            // Find the emoji page title
            fmt.Println("Emoji: ", e.ChildText("article h1"))
            // Grab all the text from the emoji's description
            fmt.Println("Description: ", e.ChildText("article .description p"))
        }
    })
    [return]