To Escape or Not to Escape, That Is The Question

Posted by Jon, Comments

ESAPI Canonicalization

From the Encoder.canonicalize JavaDoc:

Canonicalization is simply the operation of reducing a possibly encoded string down to its simplest form. This is important, because attackers frequently use encoding to change their input in a way that will bypass validation filters, but still be interpreted properly by the target of the attack. Note that data encoded more than once is not something that a normal user would generate and should be regarded as an attack.

Everyone says you shouldn't do validation without canonicalizing the data first. This is easier said than done. The canonicalize method can be used to simplify just about any input down to its most basic form. Note that canonicalize doesn't handle Unicode issues, it focuses on higher level encoding and escaping schemes. In addition to simple decoding, canonicalize also handles:

  • Perverse but legal variants of escaping schemes
  • Multiple escaping (%2526 or <)
  • Mixed escaping (%26lt;)
  • Nested escaping (%%316 or &%6ct;)
  • All combinations of multiple, mixed, and nested encoding/escaping (%253c or ┦gt;)

What's wrong with this?

Not all data that looks doubly encoded is actually doubly encoded.

Hideous Java example:

public void getTitleSuggestionsEntryPoint(HttpServletRequest req) {
    String hint = req.getParameter("hint");
    // bug 12345, infosec told us to do this
    String safeHint = getEsapiEncoder().canonicalize(hint);
    Suggestions suggestions = dao.getTitleSuggestions(safeHint);


public Suggestions getTitleSuggestions(String query) {
  try {
    Connection conn = getConnection();
    String query = "SELECT * FROM Movies WHERE title LIKE ?";
    PreparedStatement pstmt = conn.prepareStatement(query);
    pstmt.setString(1, query);
    ResultSet rs = pstmt.executeQuery();

OK, the above is totally contrived, I get it. Even more contrived, say the application allows its users to enter the "%" and "_" characters to assist in the query, which are often interpreted by many SQL drivers as "zero or many characters" and "any single character", respectively.

Now I love Kubrick films. And I want to get suggestions based on '2001 Space Odyssey' as a hint. But I can't spell odyssey without having to look it up (truth). So I enter the following: %2001 Space%. If this is via a GET request, the hint parameter should look like this: %252001%20Space%25. When the Java container receives the request, it decodes the first level of URL encoding, setting the hint field to the value of %2001 Space%. Then the field safeHint is set to  01 Space%.

Wait, what?

The value %20 could be interpreted as the URL encoded value for a space character. However, its context actually is in a SQL LIKE context, not a URL, so this interpretation is incorrect. Encoder.canonicalize doesn't know this. As it iterates over its codecs, it calls PercentCodec.decode. PercentCodec.decode checks to see: if the sequence walks like a URL encoded value and quacks like a URL encoded value, it must be a URL encoded value. So, it decodes it. It doesn't understand that the data is going into a SQL LIKE context.

And here's the above as a simple test:

(jython-env)jpasski@jpasski-mac: ~/.virtualenvs/jython-env
$ jip install org.owasp.esapi:esapi:2.1.0
# ...
$ touch /Users/jpasski/.virtualenvs/jython-env/
$ jython-all
Jython 2.5.3 (2.5:c56500f08d34+, Aug 13 2012, 14:48:36)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_17
Type "help", "copyright", "credits" or "license" for more information.
>>> from org.owasp.esapi.reference import DefaultEncoder
>>> encoder = DefaultEncoder.getInstance()
>>> encoder.canonicalize("%2001 Space%")
u' 01 Space%'

So What To Do?

Don't use Encoder.canonicalize. However, do canonicalize!

Encoder.canonicalize conflates decoding / unescaping with canonicalization. It also decodes / unescapes data without regard to the data's actual context. If it only did canonicalization I'd have no issue with it. But it doesn't.

Canonicalization is reducing multiple ways of describing the same data to just one way. For example, these two UNIX-style file paths are equivalent: /foo/../bar and /bar, with the latter being in the canonical form. In this example, the canonical form ought to be checked against a known trusted prefix, or else CWE-73 rears its ugly head. Using another example, these two sequences are not equivalent: &lt; and <. In an HTML context, the prior is an HTML named character reference that represents the escaped form of the latter character. The latter character is used to start a tag open state. Since they aren't equivalent, they shouldn't be treated as such. However, Encoder.canonicalize makes them equivalent.

Again, what to do?

Ideally, the application developer needs to understand the context to where the data is sent. A control applied at the entry point is farther away from the eventual sink contexts. One piece of tainted data could easily end up in two or more different contexts, each of which have their own security and usability requirements.

But there just isn't a silver bullet.

Canonicalize data when it can have equivalent but different forms. Validate the input, regardless if canonicalization occurs, against business requirements. And then at the point where the context requirements change, understand these new security requirements. This can be either via the use of a security library or via research / custom coding. But that need doesn't change. Once a developer understands the context, no magic needs to be performed. She or he can use the correct sanitizer, be that an escaper, filter, or validator, and move on to other things like actually adding features.