Writing a Crawler

June 28, 2019

First of all, this is going to be a long text as I write each step of my thoughts along the way to write this simple crawler. If you are in a hurry, just jump to the code.

Why write a crawler? WHY NOT?

So basically what I want is an app that you provide an initial URL and it should return me a sitemap drawn in ascci characters:

By the way, I am gonna call this project: spidy

$ run spidy http://lewebpag.es
/ (index)
|
|------ /projects
|       |
|       |------ /agrade
|       |
|       |------ /prank
|
|------ /blog
        |
        |------ /hello-world
        |
        |------ /01-2019-report
        |
        |------ /how-to-not-be-productive
        |
        |------ /redux-thuk-utils

Or something that resembles this masterpiece of sitemap.

In later project posts I want to reuse this code to do more things, like write a tool to help me check for visual regression on my projects or mine some data from pages, like social media handles.

Project top-level view

Our code should do the following things:

  1. Get initial URL and site domain from cli argument
  2. Crawl into next non-visited URL
  3. Search for <a> elements

    • Get its href
    • Store the href URL, if the path, relative or absolute, resides inside the same domain
  4. Get next non-visited URL
  5. Got to 2, unless there are no more non-visited URLs
  6. Draw sitemap based on visited URLs

Choosing our tools

I will use nodejs just for laziness, but maybe in the future, I will rewrite this in python or any other language.

There are plenty of ways of doing HTTP requests in node: vanilla nodejs HTTP, axios, node-fetch, request, etc. Today I am not feeling crazy enough to use vanilla HTTP and write my own wrapper around it to easily make requests, so I am going down the requested route.

I will use regex for the <a> elements because I don't need to traverse the DOM, I just need to find out those precious hrefs. Just as a note, if you need to do more than just find out some strings in an HTML text, DO NOT USE REGEXP you will thank me later. (HTML is a structured "programming" language, not just a bunch of random text to pattern match. Use a parser!)

Setting it up

We are going to use npm package.json structure to organize our project if you don't have node and npm installed and set, please check this out before continuing.

I am using Linux Mint, so use your brain cells to adapt anything you see here to your preferred OS.

cd workspaces/
mkdir spidy
cd spidy/
npm init -y

Show me the code already!

Slow down, we are going from our top-level abstract view and walk down each step into implementation. This is a good way to have a grasp of the entire project.

Writing gibberish

First, let's write some gibberish code and polish it later.

// 1. Get initial URL and site domain from cli argument
const domain = arguments.domain;

// 2. Crawl into next non-visited URL
let page = crawUrl(domain);
let hrefs = getHrefsFromPage(page);
// 3. Search for \<a\> elements
//     * Get its href
//     * Store the href URL, if the path, relative or absolute, resides inside the same domain
let containedUrls = hrefs.filter(url => isPathInsideDomain(url));

// 4. Get next non-visited URL
let newUrl = getNextUrlToVisit();
// 5. Got to 2, unless there are no more non-visited URLs
loop;
// 6. Draw sitemap based on visited URLs
drawMap(sitemap);

Writing the functions that we found

Although this code is utterly useless in this state we can already see some of its parts, so let's now write down those functions.

Some of the functions might be refactored later if needed.

function crawlUrl(url) {
  // request URL
  // return page
}

// change function name
function getHrefsFromHtml(html) {
  // match a regexp
  // for each match extract href
  // return hrefs
}

function isPathAbsolute(url) {
  // does URL starts with HTTP or https
}

// add domain as an attribute so we don't relay on globals
function isPathInsideDomain(url, domain) {}

function getNextUrlToVisit() {
  // check list of toVisit URLs and return first non-visited URL
}

function drawMap(sitemap) {
  // for each route
  // draw route
  // drawMap (subRoutes)
}

Filling the gaps

Let's fill above functions one by one:

crawUrl

As I said before, we are going to use the request library:

npm install request-promise-native@1.0.7
const request = require("request-promise-native");

async function crawlUrl(url) {
  return request.get(url);
}

getHrefsFromHtml

We are going to use a regex to match every anchor element that has a href attribute and capture only the href part.

const anchorRegex = /<a(?:.|\n)+?href=('.*?'|".*?")/gi;

function getHrefsFromHtml(html) {
  let matches;
  let hrefs = [];

  while ((matches = anchorRegex.exec(html)) !== null) {
    let href = matches[1];
    href = href.replace(/("|')/g, "");

    hrefs.push(href);
  }

  return hrefs;
}

isPathAbsolue

function isPathAbsolute(url) {
  return url.startsWith("http://") || url.startsWith("https://");
}

isPathInsideDomain

function isPathInsideDomain(url, domain) {
  // if path is absolute
  if (isPathAbsolute(url)) {
    return url.startsWith(domain);
  }

  return true;
}

"main"

We will need a main async function to call, otherwise, we wouldn't be able to use the async/await syntax inside it.

let domain = process.argv[2];

console.log(domain);
// Verify if the domain exists and it is an absolute path
if (!domain || !isPathAbsolute(domain)) {
  console.error("Provide a valid initial absolute path");
  exit(1);
}

async function crawl() {
  // Set vector of URLs to visit, starting with the top level website
  let toVisit = [domain];

  // Set vector of already visited websites
  let visited = [];

  // While there are URLs to visit...
  while (toVisit.length) {
    // Visit first URL in vector
    let currentUrl = toVisit.shift();

    // Await page html
    let pageHtml = await crawlUrl(currentUrl);

    // Mark URL as visited
    visited.push(currentUrl);

    // Get <a>s href
    let hrefs = getHrefsFromHtml(pageHtml);

    let nonVisitedUrls = hrefs
      // filter only URLs that are inside the domain
      .filter(url => isPathInsideDomain(url, domain))
      // remap all URLs as absolute paths
      .map(url => (url.startsWith(domain) ? url : domain + url))
      // remove last '/' from URL to not visit 'page' and 'page/'
      .map(url => (url.endsWith("/") ? url.slice(0, -1) : url))
      // filter already visited pages
      .filter(url => !visited.includes(url))
      // filter pages that are already marked to be visited
      .filter(url => !toVisit.includes(url));

    // merge toVisit URLs and new nonVisitedUrls
    toVisit = [...toVisit, ...nonVisitedUrls];
  }

  drawMap(visited);
}

crawl(domain);

drawMap

function drawMap(urls, start = domain, tab = 0) {
  let parentUrls = urls
    .map(url => url.replace(new RegExp("^" + start), ""))
    .filter(url => /^\/?[^\/]*$/.test(url));

  if (tab === 0) {
    parentUrls.shift();
    console.log("/ (index)");
  }

  parentUrls.forEach(parent => {
    let prefix = "|   ".repeat(tab);
    console.log(prefix + "|");
    console.log(prefix + "|---- " + parent);

    let childrenUrls = urls.filter(url => url.startsWith(start + parent));

    drawMap(childrenUrls, start + parent + "/", tab + 1);
  });
}

Running

Now to see our script working we just need to run:

node spidy.js https://marceloll.com

It will take some time.

The code here is just a very simple crawler, so it isn't optimized, but it does the job.

The entire code

spidy.js

package.json

Conclusion

We now have a simple tool to craw a webpage and its internal links, but it still lacks a few features:

  • Optimization (do we really need to make only 1 request at a time)
  • Extensibility (for example call a function on each page)
  • A flag to recognize subdomains as internal URLs
  • Accept localhost as URL

Try to add these features on your own.

Have you found a bug on the code?

Do you have suggestions to improve this tutorial?

Or you just think my code sucks and want to write why you think it?

Go ahead and use the comments section for that!

References