devdocs

Table of contents:

Overview

Starting from a root URL, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages’ metadata (determined by one filter), which is dumped into a JSON file at the end.

Scrapers rely on the following libraries:

There are currently two kinds of scrapers: UrlScraper which downloads files via HTTP and FileScraper which reads them from the local filesystem. They function almost identically (both use URLs), except that FileScraper substitutes the base URL with a local path before reading a file. FileScraper uses the placeholder localhost base URL by default and includes a filter to remove any URL pointing to it at the end.

To be processed, a response must meet the following requirements:

(FileScraper only checks if the file exists and is not empty.)

Each URL is requested only once (case-insensitive).

Configuration

Configuration is done via class attributes and divided into three main categories:

Note: scrapers are located in the lib/docs/scrapers directory. The class’s name must be the CamelCase equivalent of the filename.

Attributes

Filter stacks

Each scraper has two filter stacks: html_filters and text_filters. They are combined into a pipeline (using the HTML::Pipeline library) which causes each filter to hand its output to the next filter’s input.

HTML filters are executed first and manipulate a parsed version of the document (a Nokogiri node object), whereas text filters manipulate the document as a string. This separation avoids parsing the document multiple times.

Filter stacks are like sorted sets. They can modified using the following methods:

push(*names)                 # append one or more filters at the end
insert_before(index, *names) # insert one filter before another (index can be a name)
insert_after(index, *names)  # insert one filter after another (index can be a name)
replace(index, name)         # replace one filter with another (index can be a name)

“names” are require paths relative to Docs (e.g. jquery/clean_htmlDocs::Jquery::CleanHtml).

Default html_filters:

Default text_filters:

Additionally:

Filter options

The filter options are stored in the options Hash. The Hash is inheritable (a recursive copy) and empty by default.

More information about how filters work is available on the Filter Reference page.

Keeping scrapers up-to-date

In order to keep scrapers up-to-date the get_latest_version(opts) method should be overridden. If self.release is defined, this should return the latest version of the documentation. If self.release is not defined, it should return the Epoch time when the documentation was last modified. If the documentation will never change, simply return 1.0.0. The result of this method is periodically reported in a “Documentation versions report” issue which helps maintainers keep track of outdated documentations.

To make life easier, there are a few utility methods that you can use in get_latest_version: