what's the best way to parse a body of text against multiple (15+) regexes on each line?

I have a body of text that I have to scan and each line contains at least 2 and sometimes four parts of information. The problem is that each line can be 1 out of 15-20 different actions.

in ruby the current code looks somewhat like this:

text.split("\n").each do |line|  #around 20 times..

..............

      expressions['actions'].each do |pat, reg| #around 20 times

.................

this obviously is 'THE PROBLEM' I did manage to make it faster (in c++ by a 50% margin) by combining all the regexen into one but that is still not the speed I require -- I need to parse thousands of these files FAST!

Right now I match them with regexes -- however this is intolerably slow. I started with ruby and hopped over to c++ in hopes that I'd get a speed boost and it just isn't happening.

I've casually read on PEGs and grammar based parsing but it looks somewhat difficult to implement. Is this the direction I should head or are there different routes?

basically I'm parsing poker hand histories and each line of the hand history usually contains 2-3 bits of information that I need to collect: who the player was, how much money or what cards the action entailed.. etc..

sample text that needs to be parsed:

buriedtens posts $5
The button is in seat #4
*** HOLE CARDS ***
Dealt to Mayhem 31337 [8s Ad]
Sherwin7 folds
OneMiKeee folds
syhg99 calls $5
buriedtens raises to $10

after I collect this information each action is turned into an xml node

right now my ruby implementation of this is much faster than my c++ one but that's prob. just cause I have not written in c code for well over 4-5 years

UPDATE: I don't want to post all the code here but so far my hands/second look like the following:

588 hands/second -- boost::spirit in c++
60 hands/second -- 1 very long and complicated regex in c++ (all the regexen put together)
33 hands/second -- normal regex style in ruby

I'm currently testing antlr to see if we can go any further but as of right now I'm very very happy with spirit's results.

Related question: Efficiently querying one string against multiple regexes.

Answers


I would suggest

Good luck


Need Your Help

How to add DOM object on canvas

javascript dom canvas html5-canvas

The thing is about a markup tool on canvas.

delete button for current division in HTML using javascript or JQuery

javascript jquery html removeclass

I made a dynamic form with an add button that creates a new division with the clone of the form but has a different ID. However, I also want to have a delete button with each division that removes ...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.