{"id":4134,"date":"2014-07-24T09:00:08","date_gmt":"2014-07-24T12:00:08","guid":{"rendered":"http:\/\/blog.plataformatec.com.br\/?p=4134"},"modified":"2014-07-24T11:05:31","modified_gmt":"2014-07-24T14:05:31","slug":"the-new-html-sanitizer-in-rails-4-2","status":"publish","type":"post","link":"http:\/\/blog.plataformatec.com.br\/2014\/07\/the-new-html-sanitizer-in-rails-4-2\/","title":{"rendered":"The new HTML sanitizer in Rails 4.2"},"content":{"rendered":"<p>The article below was originally written by Kasper Timm Hansen (@kaspth on <a href=\"https:\/\/github.com\/kaspth\">github<\/a> &amp; <a href=\"https:\/\/twitter.com\/kaspth\">twitter<\/a>) about his work during the Google Summer of Code 2013.<\/p>\n<p>Kasper and I worked a lot changing the underlying implementation of the <code>sanitize<\/code> helper to give Rails developers a more robust, faster and secure solution to sanitize user input.<\/p>\n<p>This new implementation should be fully backward compatible, with no changes to the API, which should make the update easier.<\/p>\n<p>You can see more information about the previous and the new implementation <a href=\"https:\/\/speakerdeck.com\/rafaelfranca\/rails-the-hidden-parts\">on this talk<\/a> I presented in a Brazillian conference this year (the slides are in English).<\/p>\n<p>Now, I&#8217;ll let Kasper share his words with you.<\/p>\n<h3>Scrubbing Rails Free of HTML-scanner<\/h3>\n<p>Everyone, at least one time, has already needed to use the <a href=\"http:\/\/api.rubyonrails.org\/classes\/ActionView\/Helpers\/SanitizeHelper.html#method-i-sanitize\"><code>sanitize<\/code><\/a> method to scrub some pesky HTML away.<\/p>\n<pre lang=\"ruby\">\n<%= sanitize @article.body %>\n<\/pre>\n<p>If you were to run this on Rails 4.1 (and before) this would take advantage of the html-scanner, a vendored library inside Rails, for the sanitization. Since the summer of 2013 I have been working to destroy that notion by wiping the traces of html-scanner throughout Rails. Before you become concerned of my mental health, I didn&#8217;t do this unwarranted. I&#8217;m one of the <a href=\"http:\/\/weblog.rubyonrails.or\/2013\/5\/27\/rails-google-summer-of-code-projects\/\">Google Summer of Code students<\/a> working on Ruby on Rails. My <a href=\"https:\/\/github.com\/kaspth\/gsoc-application\">project proposal<\/a> was to kick html-scanner to the curb (technical term) and grab a hold of <a href=\"https:\/\/github.com\/flavorjones\/loofah\">Loofah<\/a> instead. Why did the old library need replacing, though?<\/p>\n<h3>The out washed HTML-scanner<\/h3>\n<p>html-scanner has been with us for a long time now. The <a href=\"https:\/\/github.com\/rails\/rails\/blob\/e8709aef56d46c9b597af05ebd847309231a888c\/actionview\/lib\/action_view\/vendor\/html-scanner\/html\/selector.rb#L1-L4\">copyright notice<\/a> in the library clocks it in at 2006, when Assaf Arkin created it. This library <a href=\"https:\/\/github.com\/rails\/rails\/blob\/e8709aef56d46c9b597af05ebd847309231a888c\/actionview\/lib\/action_view\/vendor\/html-scanner\/html\/tokenizer.rb\">relies on Regular Expressions<\/a> to recognize HTML (and XML) elements. This made the code more brittle. It was easier to introduce errors via <a href=\"https:\/\/github.com\/rails\/rails\/blob\/973490a2879358db9269ecf75e03b2777f9c9e24\/actionpack\/lib\/action_view\/vendor\/html-scanner\/html\/selector.rb#L522-L680\">complex Regular Expressions<\/a>, which also gave it a higher potential for security issues.<\/p>\n<p>The Rails Team wanted something more robust and faster, so we picked Loofah. Loofah uses <a href=\"http:\/\/nokogiri.org\">Nokogiri<\/a> for parsing, which provides a Ruby interface to either a C or Java parser depending on the Ruby implementation you use. This means Loofah is fast. It&#8217;s up to <a href=\"https:\/\/gist.github.com\/flavorjones\/170193\">60 to 100% faster<\/a> than html-scanner on larger documents and fragments.<\/p>\n<p>I started by taking a look at the <a href=\"https:\/\/github.com\/rails\/rails\/blob\/e8709aef56d46c9b597af05ebd847309231a888c\/actionview\/lib\/action_view\/helpers\/sanitize_helper.rb\"><code>SanitizeHelper<\/code><\/a> in Action View, which consists of four methods and some settings. The four methods of the are <code>sanitize<\/code>, <code>sanitize_css<\/code>, <code>strip_tags<\/code> and <code>strip_links<\/code>.<\/p>\n<p>Let&#8217;s take a look at the <code>sanitize<\/code> method.<\/p>\n<p>Comparing with the old implementation, <code>sanitize<\/code> still uses the <code>WhiteListSanitizer<\/code> class to do it&#8217;s HTML stripping. However, since Action View was pulled out of Action Pack and both needed to use this functionality, we&#8217;ve extracted this to it&#8217;s own <a href=\"https:\/\/github.com\/rafaelfranca\/rails-html-sanitizer\">gem<\/a>.<\/p>\n<h3>Developers meet Rails::Html::WhiteListSanitizer<\/h3>\n<p>When you use <code>sanitize<\/code>, you&#8217;re really using <code>WhiteListSanitizer<\/code>&#8216;s <a href=\"https:\/\/github.com\/rails\/rails-html-sanitizer\/blob\/48c0f014c99c90124bd568940060d3fcebb788a6\/lib\/rails\/html\/sanitizer.rb\"><code>sanitize<\/code><\/a> method. Let me show you the new version.<\/p>\n<pre lang=\"ruby\">\ndef sanitize(html, options = {})\n  return nil unless html\n  return html if html.empty?\n<\/pre>\n<p>No surprises here.<\/p>\n<pre lang=\"ruby\">\n  loofah_fragment = Loofah.fragment(html)\n<\/pre>\n<p>The first trace of Loofah. A <a href=\"https:\/\/github.com\/flavorjones\/loofah\/blob\/51b1e38b81dae4707df181bf5167971d13d976ea\/lib\/loofah\/html\/document_fragment.rb\">fragment<\/a> is a part of a document, but without a DOCTYPE declaration and html and body tags. A piece of a document essentially. Internally Nokogiri creates a document and pulls the parsed html out of the body tag, leaving us with a fragment.<\/p>\n<pre lang=\"ruby\">\n  if scrubber = options[:scrubber]\n    # No duck typing, Loofah ensures subclass of Loofah::Scrubber\n    loofah_fragment.scrub!(scrubber)\n<\/pre>\n<p>You can pass your own <a href=\"https:\/\/github.com\/flavorjones\/loofah\/blob\/51b1e38b81dae4707df181bf5167971d13d976ea\/lib\/loofah\/scrubber.rb\"><code>Scrubber<\/code><\/a> to <code>sanitize<\/code>! Giving you the power to choose if and how elements are sanitized. As the comment alludes, any scrubber has to be either a subclass of <code>Loofah::Scrubber<\/code> or it can wrap a block. I&#8217;ll show an example later.<\/p>\n<pre lang=\"ruby\">\n  elsif allowed_tags(options) || allowed_attributes(options)\n    @permit_scrubber.tags = allowed_tags(options)\n    @permit_scrubber.attributes = allowed_attributes(options)\n    loofah_fragment.scrub!(@permit_scrubber)\n<\/pre>\n<p>We have been very keen on maintaining backwards compatibility throughout this project, so you can still supply <code>Enumerable<\/code>s of tags and attributes to <code>sanitize<\/code>. That&#8217;s what the <a href=\"https:\/\/github.com\/rails\/rails-html-sanitizer\/blob\/48c0f014c99c90124bd568940060d3fcebb788a6\/lib\/rails\/html\/scrubbers.rb\"><code>PermitScrubber<\/code><\/a> used here handles. It manages these options and makes them work independently. If you pass one it&#8217;ll use the standard behavior for the other. See the <a href=\"https:\/\/github.com\/rails\/rails-html-sanitizer\/blob\/48c0f014c99c90124bd568940060d3fcebb788a6\/lib\/rails\/html\/scrubbers.rb#L3-45\">documentation<\/a> on what the standard behavior is.<br \/>\nYou can also set the allowed tags and attributes on the class level. Like this:<\/p>\n<pre lang=\"ruby\">\nRails::Html::Sanitizer.allowed_tags = Set.new %w(for your health)\n<\/pre>\n<p>That&#8217;s simply what <code>allowed_tags<\/code> and <code>allowed_attributes<\/code> methods are there for. They&#8217;ll return the tags or attributes from the options and fallback to the class level setting if any.<\/p>\n<pre lang=\"ruby\">\n  else\n    remove_xpaths(loofah_fragment, XPATHS_TO_REMOVE)\n    loofah_fragment.scrub!(:strip)\n  end\n<\/pre>\n<p>The <a href=\"https:\/\/github.com\/flavorjones\/loofah\/blob\/51b1e38b81dae4707df181bf5167971d13d976ea\/lib\/loofah\/scrubbers.rb#L86-96\"><code>StripScrubber<\/code><\/a> built in to <code>Loofah<\/code> will strip the tags but leave the contents of elements. Which is usually what we want. We use <code>remove_xpaths<\/code> to remove elements along with their subtrees in the few instances where we don&#8217;t. If you have trouble with the syntax above, they&#8217;re <a href=\"https:\/\/en.wikipedia.org\/wiki\/XPath\">XPath Selectors<\/a>.<\/p>\n<pre lang=\"ruby\">\n  loofah_fragment.to_s\nend\n<\/pre>\n<p>Lastly we&#8217;ll take the elements and extract the remaining markup with <code>to_s<\/code>. Internally Nokogiri will call either <a href=\"https:\/\/github.com\/sparklemotion\/nokogiri\/blob\/5546170ab7bb789645d8e96ff0eb585d73748636\/lib\/nokogiri\/xml\/node.rb#L609-L614\"><code>to_xml<\/code> or <code>to_html<\/code><\/a> depending on the kind of document or fragment you have.<\/p>\n<h3>Rub, buff or clean it off, however you like<\/h3>\n<p>So there you have it. I could go through how the other sanitizers work, but they&#8217;re not that complex. So go code spelunking in the <a href=\"https:\/\/github.com\/rails\/rails-html-sanitizer\">source<\/a>.<\/p>\n<p>If this was the first time you&#8217;ve seen a <code>Loofah::Scrubber<\/code>, be sure to check out <a href=\"https:\/\/github.com\/rails\/rails-html-sanitizer\/blob\/48c0f014c99c90124bd568940060d3fcebb788a6\/lib\/rails\/html\/scrubbers.rb\">the source<\/a> for <code>PermitScrubber<\/code> and see an example of how to implement one. You can also subclass <code>PermitScrubber<\/code> and get the sanitization you need without worrying about the implementation details of stripping elements and scrubbing attributes. Take a look at <code>TargetScrubber<\/code> &#8211; the weird <code>PermitScrubber<\/code> &#8211; and how it uses that to get scrubbing fast.<\/p>\n<p>Before I scrub off though, I promised you an example of a custom scrubber. I&#8217;ll use the option that wraps a block here, but you could easily create a subclass of <code>Loofah::Scrubber<\/code> (in a helper maybe?) and override <a href=\"https:\/\/github.com\/flavorjones\/loofah\/blob\/51b1e38b81dae4707df181bf5167971d13d976ea\/lib\/loofah\/scrubber.rb#L85-L87\"><code>scrub(node)<\/code><\/a>. So here goes:<\/p>\n<pre lang=\"ruby\">\n<%= sanitize @article.body,\n  scrubber: Loofah::Scrubber.new { |node| node.name = \"script\" } %>\n<\/pre>\n<p>The code above changes all the HTML tags included in the article body to be a tag <code>&lt;script&gt;<\/code>.<\/p>\n<p><code>&lt;sarcasm&#62;<\/code><br \/>\nIf you&#8217;re going to introduce bugs, why not make everything a potential risk of running arbitrary code?<br \/>\n<code>&lt;\/sarcasm&#62;<\/code><\/p>\n<p style=\"text-align: center;\">\n<p><span id=\"hs-cta-wrapper-2aeae558-5b72-4df3-bf32-e1119f34d85e\" class=\"hs-cta-wrapper\"><span id=\"hs-cta-2aeae558-5b72-4df3-bf32-e1119f34d85e\" class=\"hs-cta-node hs-cta-2aeae558-5b72-4df3-bf32-e1119f34d85e\"> <a href=\"http:\/\/cta-redirect.hubspot.com\/cta\/redirect\/378213\/2aeae558-5b72-4df3-bf32-e1119f34d85e\"><img decoding=\"async\" id=\"hs-cta-img-2aeae558-5b72-4df3-bf32-e1119f34d85e\" class=\"hs-cta-img aligncenter\" style=\"border-width: 0px;\" src=\"https:\/\/no-cache.hubspot.com\/cta\/default\/378213\/2aeae558-5b72-4df3-bf32-e1119f34d85e.png\" alt=\"\" \/><\/a><\/span><\/span><br \/>\n<!-- end HubSpot Call-to-Action Code --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The article below was originally written by Kasper Timm Hansen (@kaspth on github &amp; twitter) about his work during the Google Summer of Code 2013. Kasper and I worked a lot changing the underlying implementation of the sanitize helper to give Rails developers a more robust, faster and secure solution to sanitize user input. This &#8230; <a class=\"read-more-link\" href=\"http:\/\/blog.plataformatec.com.br\/2014\/07\/the-new-html-sanitizer-in-rails-4-2\/\">\u00bb<\/a><\/p>\n","protected":false},"author":15,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[7],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134"}],"collection":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=4134"}],"version-history":[{"count":12,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134\/revisions"}],"predecessor-version":[{"id":4147,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134\/revisions\/4147"}],"wp:attachment":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=4134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=4134"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=4134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}