Is there a less idiotic and/or actually effective way to do this? (recursively resolving HTML unicode entities)

I'm parsing an untrusted URI, but its URI-hood must be honored. I'm trying to protect against javascript: links, but I feel like I need to recurse on it, since you could have:

javascriptjavascript::

and after stripping out all instances of javascript: get back our old friend javascript: once again.

My other concern is analogously-nested unicode entities. For instance, we could have:

"j&#X41vascript:alert('pwnt')"

...but we could also have:

"j&#&#X5841vascript:alert('pwnt')"

...though I seem to be doing it wrong (whereas a successful attacker obviously won't.)

function resolveEntities(uri) {
  var s = document.createElement('span')
    , nestTally = uri.match(/&/) ? 0 : 1
    , limitReached = false;

  s.innerHTML = uri;
  while (s.textContent.match(/&/)) {
    s.innerHTML = s.textContent;
    if(nestTally++ >= 5) {
      limitReached = true;
      break;
    }
  }

  return encodeURI(s.textContent);
}

Answers


Rather than specifying what you want to blacklist (e.g. javascript: URIs), it's better to specify what you want to whitelist (e.g. http and https only). What about something like this:

function sanitizeUri(uri) {
  if (!uri.match(/^https?:\/\//)) {
    uri = "http://" + uri;
  }
  return uri;
}

Didn't you already ask almost the same question before? Anyway, my suggestion remains the same: use a proper HTML sanitizer.

The particular sanitizer I linked to strips javascript: URLs automatically, but you can also set it up to allow only certain whitelisted URL schemes like Thomas suggests. As he notes, this is a good idea, since it's much safer to only allow schemes like http and https which you know to be safe.

(In particular, whether a given obscure URL scheme is safe or not may depend not only on the user's browser, but also on their OS and on what third-party software they may have installed — a lot of programs like to register themselves as handlers for their own URL schemes.)


Need Your Help

ASP.NET 'Friendly URL' module - not working for root URL

asp.net friendly-url

I am successfully using the 'Friendly URL' module in ASP.NET 4.5

GridBagLayout and drawing

java swing user-interface layout

I'm making a small game and at the beginning i want to have JCheckBox for choosing the language(after that they are few more of them for setting the game) and above that a jlabel with picture with ...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.