How can I programmatically get all of a string's unicode entities to resolve themselves?

I'm trying to mitigate XSS. How can I shield from this:


in the href of a link?

I've tried the following, but it just assigns the literal, unresolved value of that above string as a relative path of the href, not a proper javascript: href capable of triggering code execution. I'm wondering how an attacker might be able to exploit this.

I've tried the following:

a = document.createElement('a');

and then both this:

a.href = 'j&#X41vascript:alert('test2')';

and this:

a.setAttribute('href', "j&#X41vascript:alert('test2')");

But both return "j&#X41vascript:alert('test2')" upon then querying a.href, not the desired (or undesired, depending on your perspective) javascript:alert('test2');

If I can get all the entities to resolve, then I can parse out all occurrences of javascript: in the resulting string, and be safe -- right?

The other thing I was thinking was that what if someone does jvascript:steal_cookie();. I mean, theoretically, they could have infinite levels of recursion, and it would all ultimately resolve, right?

Edit: how does this code look?

function resolve_entities(str) {
  var s = document.createElement('span')
    , nestTally = str.match(/&/) ? 0 : 1
    , limit = 5
    , limitReached = false;

  s.innerHTML = str;
  while (s.textContent.match(/&/)) {
    s.innerHTML = s.textContent;
    if(nestTally++ >= limit) {
      limitReached = true;

  return s.textContent;


XML/HTML character entities like A or & are decoded when the string containing them is parsed as XML or HTML. Typically, this happens when they are sent from the server to the browser as part of an HTML page, although there are other situations (such as assigning to element.innerHTML in JavaScript) which can cause a string to be parsed as XML or HTML.

Reading or writing to element attributes in JavaScript does not trigger XML/HTML parsing, and thus does not expand character entities. If you write

a.href = "jAvascript:alert('test')";

then the href attribute of that a element will be jAvascript:alert('test'), ampersands and all.

What's important to note is that, whenever a string is parsed as XML or HTML, character entities are decoded exactly once. Thus, &x41; becomes a, while A becomes A. It will not "all ultimately resolve", unless you're doing something silly like reading from .textContent and assigning to .innerHTML repeatedly.

Once the parsing is complete, it's completely irrelevant whether any character sequences in the output might or might not look like XML/HTML character entities — that is, unless you then take the output and feed it through an XML/HTML parser again. (Doing that is very rarely useful, and usually only happens due to bugs such as assigning to .innerHTML when one should have assigned to .textContent.)

Anyway, looking at the comments, you say you're writing some client-side JavaScript code that's getting some untrusted data from a server you don't control, and you're worried that simply assigning the data to .innerHTML could allow XSS attacks. If so, there are two cases:

  1. The data you receive is meant to be plain text. In that case, you should just assign it to .textContent and be done with it.

  2. The data you receive is, in fact, meant to be HTML. In that case you do need to undertake the difficult and laborious job of sanitizing it. This JavaScript HTML sanitizer from the Caja project might help.

