RegEx to exclude number using PHP

This question is a continuation of my previous question:

RegEx to exclude academic title

I want split paragraph string into array of sentences using regular expression with character dot (.). And the next problem is about number.

Here is an example :

In this year 2013. Hello Mr. Andre, your money is Rp 40.000.

Of course the correct output :

Array ( [0] => In this year 2013 [1] => Hello Mr. Andre, your money is Rp 40.000 )

The title problem (Mr.) is already solved from my question before. I've tried with adding regex of number but still don't work.

My not worked code :

$titles_number=array('(^[0-9]*)','(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles_number).')\./',$text);
print_r($sentences);

Can I do this with one blow (one regex to get rid two problem)? Tell me if I can't do it. Thanks in advance

Answers


This will be easier to accomplish with preg_match_all():

preg_match_all(
    '/[^\s.][^.]*(?:\.(?:(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)|(?=\d))[^.]*)*\./',
    $subject, $result, PREG_PATTERN_ORDER);
print_r($result[0]);

explanation:

  • [^\s.] matches the next non-whitespace character (i.e., skip over any whitespace between sentences)
  • [^.]* gobbles up any non-dot characters
  • \. matches a dot IF...
  • (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) ...it's part of an honorific...
  • (?=\d) ...or part of a number

notes:

  1. (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) is legal because the alternation is at the top level. That is, it acts like several discrete lookbehinds, each with a fixed length. That's why I had to repeat the \. in every branch instead of using (?<=(?:Prof|Dr|Mr|Mrs|Ms)\.).

  2. \.(?=\d) seems to be sufficient for identifying a dot that's part of a number. If you really have to check for digits before and after the dot, you can use (?=(?<=\d\.)\d) instead.

  3. If this is for anything more serious than a homework problem, you should discard regexes and look for a natural-language processing library. Crude as all this is, it's very close to the limit of what you can do with regexes.


Need Your Help

Obfuscation at source-code level more effective than obfuscators?

c# obfuscation

Learning from my last question, most member names seem to get included in the Project Output.

How read json/csv file in SparkR?

r hadoop apache-spark sparkr

I have Spark spark-1.4.1-bin-hadoop2.6 deployed in local mode, I'm reading input JSON file from HDFS. But methods of SparkR dataFrame read.df method cannot