RegEx to exclude number using PHP
This question is a continuation of my previous question:
I want split paragraph string into array of sentences using regular expression with character dot (.). And the next problem is about number.
Here is an example :
In this year 2013. Hello Mr. Andre, your money is Rp 40.000.
Of course the correct output :
Array (  => In this year 2013  => Hello Mr. Andre, your money is Rp 40.000 )
The title problem (Mr.) is already solved from my question before. I've tried with adding regex of number but still don't work.
My not worked code :
$titles_number=array('(^[0-9]*)','(?<!Mr)', '(?<!Mrs)', '(?<!Ms)'); $sentences=preg_split('/('.implode('',$titles_number).')\./',$text); print_r($sentences);
Can I do this with one blow (one regex to get rid two problem)? Tell me if I can't do it. Thanks in advance
This will be easier to accomplish with preg_match_all():
preg_match_all( '/[^\s.][^.]*(?:\.(?:(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)|(?=\d))[^.]*)*\./', $subject, $result, PREG_PATTERN_ORDER); print_r($result);
- [^\s.] matches the next non-whitespace character (i.e., skip over any whitespace between sentences)
- [^.]* gobbles up any non-dot characters
- \. matches a dot IF...
- (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) ...it's part of an honorific...
- (?=\d) ...or part of a number
(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) is legal because the alternation is at the top level. That is, it acts like several discrete lookbehinds, each with a fixed length. That's why I had to repeat the \. in every branch instead of using (?<=(?:Prof|Dr|Mr|Mrs|Ms)\.).
\.(?=\d) seems to be sufficient for identifying a dot that's part of a number. If you really have to check for digits before and after the dot, you can use (?=(?<=\d\.)\d) instead.
If this is for anything more serious than a homework problem, you should discard regexes and look for a natural-language processing library. Crude as all this is, it's very close to the limit of what you can do with regexes.