Convert text to XML using XSLT without knowing what the “schema” would be

I have a text document that I want to convert to XML using XSLT for easier processing. the source file is pretty general, such as this:

[{c=1,d=2},{cc=11,dd=22}]%{f=4,g=5,h={i=6,j=[7,8]}}%

I'd like to transform this to an XML file such as this:

<document>
    <header>
        <item>
            <c>1</c>
            <d>2</d>
        </item>
        <item>
            <cc>11</c>
            <dd>22</d>
        </item>
    </header>
    <content>
        <f>4</f>
        <g>5</g>
        <h>
            <i>6</i>
            <j>
                <elt>7</elt>
                <elt>8</elt>
            </j>
        </h>
    </content>
</document>

So in essence, the string before an "=" is the tag name, everything thereafter is the content (with nesting), with the only addition of the document, header, content and elt nodes. The original file will likely contain each value and all "}" on separate lines but that is not guaranteed(I don't know if that matters or not)

I found some answers for similar cases where text is converted to XML, but there the resulting node names and nesting levels are always know beforehand. Gut feeling there should be a relatively simple solution to this, but unfortunately I know only that XSLT is powerful and useful, but not who to write it...

Thanks in advance for the help, DeColaman

Answers


As Michael suggested, this indeed looks like a nice exercise for REx. The sample shows some similarity to JSON, but for demonstration, let's guess an even simpler REx grammar:

source     ::= item '%' item '%' eof
item       ::= '{' ( named-item ( ',' named-item )* )? '}'
             | '[' ( item ( ',' item )* )? ']'
             | element
named-item ::= name '=' item
<?TOKENS?>
name       ::= [a-z]+
element    ::= [0-9]+
eof        ::= $

Put it in a file named source.ebnf, and use REx to generate an XSLT-coded parser from it, by configuring options XSLT and parse tree, or using command line -xslt -tree.

The parser contains a function named p:parse-source that accepts the input as a string and turns it into a concrete syntax tree according to the above grammar. The syntax tree contains an element for each nonterminal or named token, and a TOKEN element for each unnamed token.

That syntax tree then must be transformed into the target structure. Import the generated parser from file source.xslt into the XSLT below:

<xsl:stylesheet xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
                xmlns:p="source">

  <xsl:import href="source.xslt"/>
  <xsl:output indent="yes"/>
  <xsl:variable name="input" select="'[{c=1,d=2},{cc=11,dd=22}]%{f=4,g=5,h={i=6,j=[7,8]}}%'"/>

  <xsl:template match="/">
    <xsl:variable name="parse-tree" select="p:parse-source($input)"/>
    <xsl:choose>
      <xsl:when test="not($parse-tree/self::source)">
        <xsl:sequence select="$parse-tree"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:variable name="item">
          <xsl:apply-templates select="$parse-tree/item"/>
        </xsl:variable>
        <xsl:element name="document">
          <xsl:element name="header">
            <xsl:sequence select="$item/*[1]/node()"/>
          </xsl:element>
          <xsl:element name="content">
            <xsl:sequence select="$item/*[2]/node()"/>
          </xsl:element>
        </xsl:element>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <xsl:template match="item">
    <xsl:variable name="items">
      <xsl:apply-templates select="*[not(self::TOKEN)]"/>
    </xsl:variable>
    <xsl:choose>
      <xsl:when test="count($items/*) eq 1">
        <xsl:sequence select="$items"/>      
      </xsl:when>
      <xsl:otherwise>
        <xsl:element name="item">
          <xsl:sequence select="$items"/>
        </xsl:element>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <xsl:template match="named-item">
    <xsl:element name="{name}">
      <xsl:variable name="item">
        <xsl:apply-templates select="item"/>
      </xsl:variable>
      <xsl:sequence select="$item/*/node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="element">
    <xsl:element name="elt">
       <xsl:sequence select="node()"/>
    </xsl:element>
  </xsl:template>

</xsl:stylesheet>

Running the above on an XSLT 2.0 processor, e.g. Saxon, will generate the desired result.


You're basically trying to write a parser for some grammar. Which is quite feasible to do, but it helps to know exactly what the grammar is, and it helps to know a little bit about how to write a recursive descent parser. From your sample it looks like a recursive grammar, which means you can't do it purely using regular expressions.

You might like to take a look at Rex, Gunther Rademacher's tool for generating parsers in XQuery or (recently) XSLT. It's not well documented but it's very powerful.


Need Your Help

Have I misapplied inheritance?

c++ inheritance polymorphism dynamic-cast

I am trying to set up a program that can generate balance sheets based on summing a number of transactions, and present the results in a format like this: