Match alphanumeric string in nltk grammar
I'm trying to use NTLK grammar and parse algorithms as they seem pretty simple to use. Though, I can't find a way to match an alphanumeric string properly, something like:
import nltk grammar = nltk.parse_cfg (""" # Is this possible? TEXT -> \w* """) parser = nltk.RecursiveDescentParser(grammar) print parser.parse("foo")
Is there an easy way to achieve this?
It would be very difficult to do cleanly. The base parser classes rely on exact matches or the production RHS to pop content, so it would require subclassing and rewriting large parts of the parser class. I attempted it a while ago with the feature grammar class and gave up.
What I did instead is more of a hack, but basically, I extract the regex matches from the text first, and add them to the grammar as productions. It will be very slow if you are using a large grammar since it needs to recompute the grammar and parser for every call.
import re import nltk from nltk.grammar import Nonterminal, Production, ContextFreeGrammar grammar = nltk.parse_cfg (""" S -> TEXT TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT """) productions = grammar.productions() def literal_production(key, rhs): """ Return a production <key> -> n :param key: symbol for lhs: :param rhs: string literal: """ lhs = Nonterminal(key) return Production(lhs, [rhs]) def parse(text): """ Parse some text. """ # extract new words and numbers words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)]) numbers = set([match.group(0) for match in re.finditer(r"\d+", text)]) # Make a local copy of productions lproductions = list(productions) # Add a production for every words and number lproductions.extend([literal_production("WORD", word) for word in words]) lproductions.extend([literal_production("NUMBER", number) for number in numbers]) # Make a local copy of the grammar with extra productions lgrammar = ContextFreeGrammar(grammar.start(), lproductions) # Load grammar into a parser parser = nltk.RecursiveDescentParser(lgrammar) tokens = text.split() return parser.parse(tokens) print parse("foo hello world 123 foo")
Here's more background where this was discussed on the nltk-users group on google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion