{"id":4134,"date":"2014-07-24T09:00:08","date_gmt":"2014-07-24T12:00:08","guid":{"rendered":"http:\/\/blog.plataformatec.com.br\/?p=4134"},"modified":"2014-07-24T11:05:31","modified_gmt":"2014-07-24T14:05:31","slug":"the-new-html-sanitizer-in-rails-4-2","status":"publish","type":"post","link":"http:\/\/blog.plataformatec.com.br\/2014\/07\/the-new-html-sanitizer-in-rails-4-2\/","title":{"rendered":"The new HTML sanitizer in Rails 4.2"},"content":{"rendered":"
The article below was originally written by Kasper Timm Hansen (@kaspth on github<\/a> & twitter<\/a>) about his work during the Google Summer of Code 2013.<\/p>\n Kasper and I worked a lot changing the underlying implementation of the This new implementation should be fully backward compatible, with no changes to the API, which should make the update easier.<\/p>\n You can see more information about the previous and the new implementation on this talk<\/a> I presented in a Brazillian conference this year (the slides are in English).<\/p>\n Now, I’ll let Kasper share his words with you.<\/p>\n Everyone, at least one time, has already needed to use the If you were to run this on Rails 4.1 (and before) this would take advantage of the html-scanner, a vendored library inside Rails, for the sanitization. Since the summer of 2013 I have been working to destroy that notion by wiping the traces of html-scanner throughout Rails. Before you become concerned of my mental health, I didn’t do this unwarranted. I’m one of the Google Summer of Code students<\/a> working on Ruby on Rails. My project proposal<\/a> was to kick html-scanner to the curb (technical term) and grab a hold of Loofah<\/a> instead. Why did the old library need replacing, though?<\/p>\n html-scanner has been with us for a long time now. The copyright notice<\/a> in the library clocks it in at 2006, when Assaf Arkin created it. This library relies on Regular Expressions<\/a> to recognize HTML (and XML) elements. This made the code more brittle. It was easier to introduce errors via complex Regular Expressions<\/a>, which also gave it a higher potential for security issues.<\/p>\n The Rails Team wanted something more robust and faster, so we picked Loofah. Loofah uses Nokogiri<\/a> for parsing, which provides a Ruby interface to either a C or Java parser depending on the Ruby implementation you use. This means Loofah is fast. It’s up to 60 to 100% faster<\/a> than html-scanner on larger documents and fragments.<\/p>\n I started by taking a look at the Let’s take a look at the Comparing with the old implementation, When you use No surprises here.<\/p>\n The first trace of Loofah. A fragment<\/a> is a part of a document, but without a DOCTYPE declaration and html and body tags. A piece of a document essentially. Internally Nokogiri creates a document and pulls the parsed html out of the body tag, leaving us with a fragment.<\/p>\n You can pass your own We have been very keen on maintaining backwards compatibility throughout this project, so you can still supply That’s simply what The Lastly we’ll take the elements and extract the remaining markup with So there you have it. I could go through how the other sanitizers work, but they’re not that complex. So go code spelunking in the source<\/a>.<\/p>\n If this was the first time you’ve seen a Before I scrub off though, I promised you an example of a custom scrubber. I’ll use the option that wraps a block here, but you could easily create a subclass of The code above changes all the HTML tags included in the article body to be a tag \n <\/a><\/span><\/span> The article below was originally written by Kasper Timm Hansen (@kaspth on github & twitter) about his work during the Google Summer of Code 2013. Kasper and I worked a lot changing the underlying implementation of the sanitize helper to give Rails developers a more robust, faster and secure solution to sanitize user input. This … \u00bb<\/a><\/p>\n","protected":false},"author":15,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[7],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134"}],"collection":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=4134"}],"version-history":[{"count":12,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134\/revisions"}],"predecessor-version":[{"id":4147,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4134\/revisions\/4147"}],"wp:attachment":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=4134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=4134"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=4134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}sanitize<\/code> helper to give Rails developers a more robust, faster and secure solution to sanitize user input.<\/p>\n
Scrubbing Rails Free of HTML-scanner<\/h3>\n
sanitize<\/code><\/a> method to scrub some pesky HTML away.<\/p>\n
\n<%= sanitize @article.body %>\n<\/pre>\n
The out washed HTML-scanner<\/h3>\n
SanitizeHelper<\/code><\/a> in Action View, which consists of four methods and some settings. The four methods of the are
sanitize<\/code>,
sanitize_css<\/code>,
strip_tags<\/code> and
strip_links<\/code>.<\/p>\n
sanitize<\/code> method.<\/p>\n
sanitize<\/code> still uses the
WhiteListSanitizer<\/code> class to do it’s HTML stripping. However, since Action View was pulled out of Action Pack and both needed to use this functionality, we’ve extracted this to it’s own gem<\/a>.<\/p>\n
Developers meet Rails::Html::WhiteListSanitizer<\/h3>\n
sanitize<\/code>, you’re really using
WhiteListSanitizer<\/code>‘s
sanitize<\/code><\/a> method. Let me show you the new version.<\/p>\n
\ndef sanitize(html, options = {})\n return nil unless html\n return html if html.empty?\n<\/pre>\n
\n loofah_fragment = Loofah.fragment(html)\n<\/pre>\n
\n if scrubber = options[:scrubber]\n # No duck typing, Loofah ensures subclass of Loofah::Scrubber\n loofah_fragment.scrub!(scrubber)\n<\/pre>\n
Scrubber<\/code><\/a> to
sanitize<\/code>! Giving you the power to choose if and how elements are sanitized. As the comment alludes, any scrubber has to be either a subclass of
Loofah::Scrubber<\/code> or it can wrap a block. I’ll show an example later.<\/p>\n
\n elsif allowed_tags(options) || allowed_attributes(options)\n @permit_scrubber.tags = allowed_tags(options)\n @permit_scrubber.attributes = allowed_attributes(options)\n loofah_fragment.scrub!(@permit_scrubber)\n<\/pre>\n
Enumerable<\/code>s of tags and attributes to
sanitize<\/code>. That’s what the
PermitScrubber<\/code><\/a> used here handles. It manages these options and makes them work independently. If you pass one it’ll use the standard behavior for the other. See the documentation<\/a> on what the standard behavior is.
\nYou can also set the allowed tags and attributes on the class level. Like this:<\/p>\n\nRails::Html::Sanitizer.allowed_tags = Set.new %w(for your health)\n<\/pre>\n
allowed_tags<\/code> and
allowed_attributes<\/code> methods are there for. They’ll return the tags or attributes from the options and fallback to the class level setting if any.<\/p>\n
\n else\n remove_xpaths(loofah_fragment, XPATHS_TO_REMOVE)\n loofah_fragment.scrub!(:strip)\n end\n<\/pre>\n
StripScrubber<\/code><\/a> built in to
Loofah<\/code> will strip the tags but leave the contents of elements. Which is usually what we want. We use
remove_xpaths<\/code> to remove elements along with their subtrees in the few instances where we don’t. If you have trouble with the syntax above, they’re XPath Selectors<\/a>.<\/p>\n
\n loofah_fragment.to_s\nend\n<\/pre>\n
to_s<\/code>. Internally Nokogiri will call either
to_xml<\/code> or
to_html<\/code><\/a> depending on the kind of document or fragment you have.<\/p>\n
Rub, buff or clean it off, however you like<\/h3>\n
Loofah::Scrubber<\/code>, be sure to check out the source<\/a> for
PermitScrubber<\/code> and see an example of how to implement one. You can also subclass
PermitScrubber<\/code> and get the sanitization you need without worrying about the implementation details of stripping elements and scrubbing attributes. Take a look at
TargetScrubber<\/code> – the weird
PermitScrubber<\/code> – and how it uses that to get scrubbing fast.<\/p>\n
Loofah::Scrubber<\/code> (in a helper maybe?) and override
scrub(node)<\/code><\/a>. So here goes:<\/p>\n
\n<%= sanitize @article.body,\n scrubber: Loofah::Scrubber.new { |node| node.name = \"script\" } %>\n<\/pre>\n
<script><\/code>.<\/p>\n
<sarcasm><\/code>
\nIf you’re going to introduce bugs, why not make everything a potential risk of running arbitrary code?
\n<\/sarcasm><\/code><\/p>\n
\n<\/p>\n","protected":false},"excerpt":{"rendered":"