The new HTML sanitizer in Rails 4.2

The article below was originally written by Kasper Timm Hansen (@kaspth on github & twitter) about his work during the Google Summer of Code 2013.

Kasper and I worked a lot changing the underlying implementation of the sanitize helper to give Rails developers a more robust, faster and secure solution to sanitize user input.

This new implementation should be fully backward compatible, with no changes to the API, which should make the update easier.

You can see more information about the previous and the new implementation on this talk I presented in a Brazillian conference this year (the slides are in English).

Now, I’ll let Kasper share his words with you.

Scrubbing Rails Free of HTML-scanner

Everyone, at least one time, has already needed to use the sanitize method to scrub some pesky HTML away.

<% sanitize @article.body %>

If you were to run this on Rails 4.1 (and before) this would take advantage of the html-scanner, a vendored library inside Rails, for the sanitization. Since the summer of 2013 I have been working to destroy that notion by wiping the traces of html-scanner throughout Rails. Before you become concerned of my mental health, I didn’t do this unwarranted. I’m one of the Google Summer of Code students working on Ruby on Rails. My project proposal was to kick html-scanner to the curb (technical term) and grab a hold of Loofah instead. Why did the old library need replacing, though?

The out washed HTML-scanner

html-scanner has been with us for a long time now. The copyright notice in the library clocks it in at 2006, when Assaf Arkin created it. This library relies on Regular Expressions to recognize HTML (and XML) elements. This made the code more brittle. It was easier to introduce errors via complex Regular Expressions, which also gave it a higher potential for security issues.

The Rails Team wanted something more robust and faster, so we picked Loofah. Loofah uses Nokogiri for parsing, which provides a Ruby interface to either a C or Java parser depending on the Ruby implementation you use. This means Loofah is fast. It’s up to 60 to 100% faster than html-scanner on larger documents and fragments.

I started by taking a look at the SanitizeHelper in Action View, which consists of four methods and some settings. The four methods of the are sanitize, sanitize_css, strip_tags and strip_links.

Let’s take a look at the sanitize method.

Comparing with the old implementation, sanitize still uses the WhiteListSanitizer class to do it’s HTML stripping. However, since Action View was pulled out of Action Pack and both needed to use this functionality, we’ve extracted this to it’s own gem.

Developers meet Rails::Html::WhiteListSanitizer

When you use sanitize, you’re really using WhiteListSanitizer‘s sanitize method. Let me show you the new version.

def sanitize(html, options = {})
  return nil unless html
  return html if html.empty?

No surprises here.

  loofah_fragment = Loofah.fragment(html)

The first trace of Loofah. A fragment is a part of a document, but without a DOCTYPE declaration and html and body tags. A piece of a document essentially. Internally Nokogiri creates a document and pulls the parsed html out of the body tag, leaving us with a fragment.

  if scrubber = options[:scrubber]
    # No duck typing, Loofah ensures subclass of Loofah::Scrubber
    loofah_fragment.scrub!(scrubber)

You can pass your own Scrubber to sanitize! Giving you the power to choose if and how elements are sanitized. As the comment alludes, any scrubber has to be either a subclass of Loofah::Scrubber or it can wrap a block. I’ll show an example later.

  elsif allowed_tags(options) || allowed_attributes(options)
    @permit_scrubber.tags = allowed_tags(options)
    @permit_scrubber.attributes = allowed_attributes(options)
    loofah_fragment.scrub!(@permit_scrubber)

We have been very keen on maintaining backwards compatibility throughout this project, so you can still supply Enumerables of tags and attributes to sanitize. That’s what the PermitScrubber used here handles. It manages these options and makes them work independently. If you pass one it’ll use the standard behavior for the other. See the documentation on what the standard behavior is.
You can also set the allowed tags and attributes on the class level. Like this:

Rails::Html::Sanitizer.allowed_tags = Set.new %w(for your health)

That’s simply what allowed_tags and allowed_attributes methods are there for. They’ll return the tags or attributes from the options and fallback to the class level setting if any.

  else
    remove_xpaths(loofah_fragment, XPATHS_TO_REMOVE)
    loofah_fragment.scrub!(:strip)
  end

The StripScrubber built in to Loofah will strip the tags but leave the contents of elements. Which is usually what we want. We use remove_xpaths to remove elements along with their subtrees in the few instances where we don’t. If you have trouble with the syntax above, they’re XPath Selectors.

  loofah_fragment.to_s
end

Lastly we’ll take the elements and extract the remaining markup with to_s. Internally Nokogiri will call either to_xml or to_html depending on the kind of document or fragment you have.

Rub, buff or clean it off, however you like

So there you have it. I could go through how the other sanitizers work, but they’re not that complex. So go code spelunking in the source.

If this was the first time you’ve seen a Loofah::Scrubber, be sure to check out the source for PermitScrubber and see an example of how to implement one. You can also subclass PermitScrubber and get the sanitization you need without worrying about the implementation details of stripping elements and scrubbing attributes. Take a look at TargetScrubber – the weird PermitScrubber – and how it uses that to get scrubbing fast.

Before I scrub off though, I promised you an example of a custom scrubber. I’ll use the option that wraps a block here, but you could easily create a subclass of Loofah::Scrubber (in a helper maybe?) and override scrub(node). So here goes:

<% sanitize @article.body scrubber: loofah::scrubber.new |node| node.name="script" %>

The code above changes all the HTML tags included in the article body to be a tag <script>.

<sarcasm>
If you’re going to introduce bugs, why not make everything a potential risk of running arbitrary code?
</sarcasm>


9 responses to “The new HTML sanitizer in Rails 4.2”

  1. Does this actually mean, that now rails will depend from native dependencies from the start?

  2. Micah Geisel says:

    Yeah, IIRC there was a policy in place for rails (and its dependencies) to be 100% Ruby. Has this changed?

  3. We do avoid adding native dependencies on the Rails framework but for his case we chose to use they because Nokogiri bundle the dependencies inside the gem and it also already has binary distributions to JRuby and Windows platforms.

    The easy to start / development workflow should not change.

  4. Isaac Andrade says:

    Is the change compatible with Rails 4.1?

  5. It should be. The Public API is the same and the result in most case are also the same.

  6. This seems a cool project. Maybe we will have an adapter for it on Rails later.

  7. Yorick says:

    Is this meant to be working in rails 4.2? I tried calling sanitize but i think it still calls the old deprecated sanitizer. If you look here (https://github.com/rails/rails-html-sanitizer/blob/master/lib/rails-html-sanitizer.rb#L33) it tries to override sanitizer_vendor but this method lives in the ClassMethods module so it has no effect.