devdocs

Table of contents:

Overview

Filters use the HTML::Pipeline library. They take an HTML string or Nokogiri node as input, optionally perform modifications and/or extract information from it, and then outputs the result. Together they form a pipeline where each filter hands its output to the next filter’s input. Every documentation page passes through this pipeline before being copied on the local filesystem.

Filters are subclasses of the Docs::Filter class and require a call method. A basic implementation looks like this:

module Docs
  class CustomFilter < Filter
    def call
      doc
    end
  end
end

Filters which manipulate the Nokogiri node object (doc and related methods) are HTML filters and must not manipulate the HTML string (html). Vice-versa, filters which manipulate the string representation of the document are text filters and must not manipulate the Nokogiri node object. The two types are divided into two stacks within the scrapers. These stacks are then combined into a pipeline that calls the HTML filters before the text filters (more details here). This is to avoid parsing the document multiple times.

The call method must return either doc or html, depending on the type of filter.

Instance methods

Core filters

Custom filters

Scrapers can have any number of custom filters but require at least the two described below.

Note: filters are located in the lib/docs/filters directory. The class’s name must be the CamelCase equivalent of the filename.

CleanHtmlFilter

The CleanHtml filter is tasked with cleaning the HTML markup where necessary and removing anything superfluous or nonessential. Only the core documentation should remain at the end.

Nokogiri’s many jQuery-like methods make it easy to search and modify elements — see the API docs.

Here’s an example implementation that covers the most common use-cases:

module Docs
  class MyScraper
    class CleanHtmlFilter < Filter
      def call
        css('hr').remove
        css('#changelog').remove if root_page?

        # Set id attributes on <h3> instead of an empty <a>
        css('h3').each do |node|
          node['id'] = node.at_css('a')['id']
        end

        # Make proper table headers
        css('td.header').each do |node|
          node.name = 'th'
        end

        # Remove code highlighting
        css('pre').each do |node|
          node.content = node.content
        end

        doc
      end
    end
  end
end

Notes:

EntriesFilter

The Entries filter is responsible for extracting the page’s metadata, represented by a set of entries, each with a name, type and path.

The following two models are used under the hood to represent the metadata:

Each scraper must implement its own EntriesFilter by subclassing the Docs::EntriesFilter class. The base class already implements the call method and includes four methods which the subclasses can override:

The following accessors are also available, but must not be overridden:

Notes:

Example:

module Docs
  class MyScraper
    class EntriesFilter < Docs::EntriesFilter
      def get_name
        node = at_css('h1')
        result = node.content.strip
        result << ' event' if type == 'Events'
        result << '()' if node['class'].try(:include?, 'function')
        result
      end

      def get_type
        object, method = *slug.split('/')
        method ? object : 'Miscellaneous'
      end

      def additional_entries
        return [] if root_page?

        css('h2').map do |node|
          [node.content, node['id']]
        end
      end

      def include_default_entry?
        !at_css('.obsolete')
      end
    end
  end
end

return [[Home]]