Finding a substring token in a string in C++

OK suppose I'm parsing some XML (the problem exists while reading any "language" but XML is one many people are familar with).

The XML looks like the following:

<Tag>
  <[CDATA[ blah blah]]>
  <Tag2>
    <Tag3/>
  </Tag2>
<Tag>

Now I want to find the various tokens in that stream. The important tokens are as follows (Please excuse my crappy "token" names ;)).

<           = Open Token
<[CDATA[    = Open CDATA Token
]]>         = Close CDATA Token
<!          = Open Comment Token
/>          = Close Open Token
</          = Open Close Token
>           = Close Token

The problem I have is that I have an array of the above and I am trying to correctly identify one of the above Token's as I read in the file character by character.

So I read the first character, '<'. The instant thought is that this matches with the "Open Token" so we'll select that. However that also matches with the first character of the "Open Close Token". So lets say we read the second character and its a'T'. So I instantly know this is the "Open Token" and not the "Open Close Token".

Equally when finishing a tag, e.g "/>". I read the first character and I get '/'. This matches the "Close Open Token". But its not complete so I should check for the next character, which in this case is '>' giving me "/>" which does match the Close Token.

My issue is that when the numbers of these tokens increases significantly it becomes quite hard to track what the possible matches are. Is there an elegant way to do this? Or should I, simply when I meet the first character of one of the "token strings" push that token on to a vector and then ONLY check those tokens on subsequent reads? If the next character does not match then I can clear the list of tokens and then start over.

Is this the correct way to approach the problem? Is there a better way?

(Edit: Please don't point me towards Lexx, YACC etc ... I'm trying to learn some fundamentals here)

Any help would be much appreciated :)

Answers


You need to track state in the parser - where am I now? what am I expecting next? - in a context-specific way. When you see what you get next, you check it against a list of valid values for current state, and possibly store a completed parsed data item, and possibly change state.

Parsing XML only looks easy by the way - if you really want to do this work by hand yourself, there are lots of corner cases to handle. Your parser is a Finite State Machine, but a non-trivial example of such.


I've been doing a lot of this type of parsing lately (mostly with C#).

I don't know exactly what you are trying to accomplish so I'm not sure how much help this is, but I would parse the entire thing and store it in some sort of data array.

Find the start tag. Then parse whatever text comes next (you know when you've reached the end of the text because you'll either hit whitespace or punctuation).

You can put in a special test for "!" and maybe set a flag in your data structure when it was found. I've found that it's just not practical to do a quick scan for known sequences. You need to break up the entire thing, character by character.

You can see one of my C# results at http://www.softcircuits.com/Blog/post/2010/02/07/Parsing-HTML-Tags-in-C.aspx.


Need Your Help

Fast combinatoric generator in Python

python performance generator combinatorics

As part of a large project in python, I need a fast generator function that produces all possible sets of non-negative integer numbers smaller than n, such that each set has at most s elements and ...

A2R library colored dendrogram, allow more than six chars per label

r

A follow up question to my previous one. R Colored dendrogram suggestions?

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.