{"id":4134,"date":"2014-07-24T09:00:08","date_gmt":"2014-07-24T12:00:08","guid":{"rendered":"http:\/\/blog.plataformatec.com.br\/?p=4134"},"modified":"2014-07-24T11:05:31","modified_gmt":"2014-07-24T14:05:31","slug":"the-new-html-sanitizer-in-rails-4-2","status":"publish","type":"post","link":"http:\/\/blog.plataformatec.com.br\/2014\/07\/the-new-html-sanitizer-in-rails-4-2\/","title":{"rendered":"The new HTML sanitizer in Rails 4.2"},"content":{"rendered":"

The article below was originally written by Kasper Timm Hansen (@kaspth on github<\/a> & twitter<\/a>) about his work during the Google Summer of Code 2013.<\/p>\n

Kasper and I worked a lot changing the underlying implementation of the sanitize<\/code> helper to give Rails developers a more robust, faster and secure solution to sanitize user input.<\/p>\n

This new implementation should be fully backward compatible, with no changes to the API, which should make the update easier.<\/p>\n

You can see more information about the previous and the new implementation on this talk<\/a> I presented in a Brazillian conference this year (the slides are in English).<\/p>\n

Now, I’ll let Kasper share his words with you.<\/p>\n

Scrubbing Rails Free of HTML-scanner<\/h3>\n

Everyone, at least one time, has already needed to use the sanitize<\/code><\/a> method to scrub some pesky HTML away.<\/p>\n

\n<%= sanitize @article.body %>\n<\/pre>\n

If you were to run this on Rails 4.1 (and before) this would take advantage of the html-scanner, a vendored library inside Rails, for the sanitization. Since the summer of 2013 I have been working to destroy that notion by wiping the traces of html-scanner throughout Rails. Before you become concerned of my mental health, I didn’t do this unwarranted. I’m one of the Google Summer of Code students<\/a> working on Ruby on Rails. My project proposal<\/a> was to kick html-scanner to the curb (technical term) and grab a hold of Loofah<\/a> instead. Why did the old library need replacing, though?<\/p>\n

The out washed HTML-scanner<\/h3>\n

html-scanner has been with us for a long time now. The copyright notice<\/a> in the library clocks it in at 2006, when Assaf Arkin created it. This library relies on Regular Expressions<\/a> to recognize HTML (and XML) elements. This made the code more brittle. It was easier to introduce errors via complex Regular Expressions<\/a>, which also gave it a higher potential for security issues.<\/p>\n

The Rails Team wanted something more robust and faster, so we picked Loofah. Loofah uses Nokogiri<\/a> for parsing, which provides a Ruby interface to either a C or Java parser depending on the Ruby implementation you use. This means Loofah is fast. It’s up to 60 to 100% faster<\/a> than html-scanner on larger documents and fragments.<\/p>\n

I started by taking a look at the SanitizeHelper<\/code><\/a> in Action View, which consists of four methods and some settings. The four methods of the are sanitize<\/code>, sanitize_css<\/code>, strip_tags<\/code> and strip_links<\/code>.<\/p>\n

Let’s take a look at the sanitize<\/code> method.<\/p>\n

Comparing with the old implementation, sanitize<\/code> still uses the WhiteListSanitizer<\/code> class to do it’s HTML stripping. However, since Action View was pulled out of Action Pack and both needed to use this functionality, we’ve extracted this to it’s own gem<\/a>.<\/p>\n

Developers meet Rails::Html::WhiteListSanitizer<\/h3>\n

When you use sanitize<\/code>, you’re really using WhiteListSanitizer<\/code>‘s sanitize<\/code><\/a> method. Let me show you the new version.<\/p>\n

\ndef sanitize(html, options = {})\n  return nil unless html\n  return html if html.empty?\n<\/pre>\n

No surprises here.<\/p>\n

\n  loofah_fragment = Loofah.fragment(html)\n<\/pre>\n

The first trace of Loofah. A fragment<\/a> is a part of a document, but without a DOCTYPE declaration and html and body tags. A piece of a document essentially. Internally Nokogiri creates a document and pulls the parsed html out of the body tag, leaving us with a fragment.<\/p>\n

\n  if scrubber = options[:scrubber]\n    # No duck typing, Loofah ensures subclass of Loofah::Scrubber\n    loofah_fragment.scrub!(scrubber)\n<\/pre>\n

You can pass your own Scrubber<\/code><\/a> to sanitize<\/code>! Giving you the power to choose if and how elements are sanitized. As the comment alludes, any scrubber has to be either a subclass of Loofah::Scrubber<\/code> or it can wrap a block. I’ll show an example later.<\/p>\n

\n  elsif allowed_tags(options) || allowed_attributes(options)\n    @permit_scrubber.tags = allowed_tags(options)\n    @permit_scrubber.attributes = allowed_attributes(options)\n    loofah_fragment.scrub!(@permit_scrubber)\n<\/pre>\n

We have been very keen on maintaining backwards compatibility throughout this project, so you can still supply Enumerable<\/code>s of tags and attributes to sanitize<\/code>. That’s what the PermitScrubber<\/code><\/a> used here handles. It manages these options and makes them work independently. If you pass one it’ll use the standard behavior for the other. See the documentation<\/a> on what the standard behavior is.
\nYou can also set the allowed tags and attributes on the class level. Like this:<\/p>\n

\nRails::Html::Sanitizer.allowed_tags = Set.new %w(for your health)\n<\/pre>\n

That’s simply what allowed_tags<\/code> and allowed_attributes<\/code> methods are there for. They’ll return the tags or attributes from the options and fallback to the class level setting if any.<\/p>\n

\n  else\n    remove_xpaths(loofah_fragment, XPATHS_TO_REMOVE)\n    loofah_fragment.scrub!(:strip)\n  end\n<\/pre>\n

The StripScrubber<\/code><\/a> built in to Loofah<\/code> will strip the tags but leave the contents of elements. Which is usually what we want. We use remove_xpaths<\/code> to remove elements along with their subtrees in the few instances where we don’t. If you have trouble with the syntax above, they’re XPath Selectors<\/a>.<\/p>\n

\n  loofah_fragment.to_s\nend\n<\/pre>\n

Lastly we’ll take the elements and extract the remaining markup with to_s<\/code>. Internally Nokogiri will call either to_xml<\/code> or to_html<\/code><\/a> depending on the kind of document or fragment you have.<\/p>\n

Rub, buff or clean it off, however you like<\/h3>\n

So there you have it. I could go through how the other sanitizers work, but they’re not that complex. So go code spelunking in the source<\/a>.<\/p>\n

If this was the first time you’ve seen a Loofah::Scrubber<\/code>, be sure to check out the source<\/a> for PermitScrubber<\/code> and see an example of how to implement one. You can also subclass PermitScrubber<\/code> and get the sanitization you need without worrying about the implementation details of stripping elements and scrubbing attributes. Take a look at TargetScrubber<\/code> – the weird PermitScrubber<\/code> – and how it uses that to get scrubbing fast.<\/p>\n

Before I scrub off though, I promised you an example of a custom scrubber. I’ll use the option that wraps a block here, but you could easily create a subclass of Loofah::Scrubber<\/code> (in a helper maybe?) and override scrub(node)<\/code><\/a>. So here goes:<\/p>\n

\n<%= sanitize @article.body,\n  scrubber: Loofah::Scrubber.new { |node| node.name = \"script\" } %>\n<\/pre>\n

The code above changes all the HTML tags included in the article body to be a tag <script><\/code>.<\/p>\n

<sarcasm><\/code>
\nIf you’re going to introduce bugs, why not make everything a potential risk of running arbitrary code?
\n<\/sarcasm><\/code><\/p>\n

\n

\"\"<\/a><\/span><\/span>
\n<\/p>\n","protected":false},"excerpt":{"rendered":"

The article below was originally written by Kasper Timm Hansen (@kaspth on github & twitter) about his work during the Google Summer of Code 2013. Kasper and I worked a lot changing the underlying implementation of the sanitize helper to give Rails developers a more robust, faster and secure solution to sanitize user input. This … \u00bb<\/a><\/p>\n","protected":false},"author":15,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[7],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134"}],"collection":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=4134"}],"version-history":[{"count":12,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134\/revisions"}],"predecessor-version":[{"id":4147,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134\/revisions\/4147"}],"wp:attachment":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=4134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=4134"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=4134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}