Match alphanumeric string in nltk grammar

I'm trying to use NTLK grammar and parse algorithms as they seem pretty simple to use. Though, I can't find a way to match an alphanumeric string properly, something like:

import nltk
grammar = nltk.parse_cfg ("""
# Is this possible?
TEXT -> \w*  
""")

parser = nltk.RecursiveDescentParser(grammar)

print parser.parse("foo")

Is there an easy way to achieve this?

Answers


It would be very difficult to do cleanly. The base parser classes rely on exact matches or the production RHS to pop content, so it would require subclassing and rewriting large parts of the parser class. I attempted it a while ago with the feature grammar class and gave up.

What I did instead is more of a hack, but basically, I extract the regex matches from the text first, and add them to the grammar as productions. It will be very slow if you are using a large grammar since it needs to recompute the grammar and parser for every call.

import re

import nltk
from nltk.grammar import Nonterminal, Production, ContextFreeGrammar

grammar = nltk.parse_cfg ("""
S -> TEXT
TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT
""")

productions = grammar.productions()

def literal_production(key, rhs):
    """ Return a production <key> -> n 

    :param key: symbol for lhs:
    :param rhs: string literal:
    """
    lhs = Nonterminal(key)
    return Production(lhs, [rhs])

def parse(text):
    """ Parse some text.
"""

    # extract new words and numbers
    words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)])
    numbers = set([match.group(0) for match in re.finditer(r"\d+", text)])

    # Make a local copy of productions
    lproductions = list(productions)

    # Add a production for every words and number
    lproductions.extend([literal_production("WORD", word) for word in words])
    lproductions.extend([literal_production("NUMBER", number) for number in numbers])

    # Make a local copy of the grammar with extra productions
    lgrammar = ContextFreeGrammar(grammar.start(), lproductions)

    # Load grammar into a parser
    parser = nltk.RecursiveDescentParser(lgrammar)

    tokens = text.split()

    return parser.parse(tokens)

print parse("foo hello world 123 foo")

Here's more background where this was discussed on the nltk-users group on google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion


Need Your Help

Inherit from Seq

f#

I want to create my own custom collection type.

WCF + Ajax query

c# ajax wcf

I have problem with ajax query, i try make post query for wcf service and this don't work. Help me please.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.