Java Regex: Replace character unless preceded by other character

I am using Java and Regular Expressions and need to split some data into multiple entities. In my input a single quote character (') specifies an end of entity UNLESS its preceded by the escape character which is a question mark (?).

My RegEx is (?<!\\?)\\' and I'm using a Scanner to split the input into separate entities. So the following cases work correctly:

Hello'There  becomes 2 entities: Hello and There
Hello?'There remains 1 entity:   Hello?'There

However when I encounter the case where I want to escape the question mark it doesn't work. So:

Hello??'There     should become 2 entities:   Hello?? and There
Hello???'There    should become 1 entity:     Hello???'There
Hello????'There   should become 2 entities:   Hello???? and There
Hello?????'There  should become 1 entity:     Hello????'There
Hello?????There   should become 1 entity:     Hello????There
Hello??????There  should become 1 entity:     Hello?????There

Thus the rule is if there are an even number of question marks, followed by a quote, it should be split. If there are an odd number of question marks then it should not split.

Can someone help fix my Regex (hopefully with an explanation!) to cope with the multiple cases?

Thanks,

Phil

Answers


Don't use split() for this. That seems like the obvious solution, but it's much easier to match the entities themselves than it is to match the delimiters. Most of the regex-enabled languages have built-in methods for this, like Python's findall() or Ruby's scan(), but in Java we're still stuck with writing boilerplate. Here's an example:

Pattern p = Pattern.compile("([^?']|\\?.)+");
String[] inputs = {
    "Hello??'There",
    "Hello???'There",
    "Hello????'There",
    "Hello?????'There",
    "Hello?????There",
    "Hello??????There"
};
for (String s : inputs)
{
  System.out.printf("%n%s :%n", s);
  Matcher m = p.matcher(s);
  while (m.find())
  {
    System.out.printf("  %s%n", m.group());
  }
}

output:

Hello??'There :
  Hello??
  There

Hello???'There :
  Hello???'There

Hello????'There :
  Hello????
  There

Hello?????'There :
  Hello?????'There

Hello?????There :
  Hello?????There

Hello??????There :
  Hello??????There

The arbitrary-max-length gimmick Thomas used, besides being a disgusting hack (no offense intended, Thomas!), is unreliable because they keep introducing bugs into the Pattern.java code that handles that stuff. But don't think of this solution as another workaround; lookbehinds should never be your first resort, even in flavors like .NET where they work reliably and restriction-free.


Try this expression to match even cases: (?<=[^\?](?>\?\?){0,1000})'

  • (?<=...)' is a positive look behing, i.e. every ' which is preceded by the expression between (?<= and ) will match
  • (?>\?\?) is an atomic group of 2 consecutive question marks
  • (?>\?\?){0,1000} means there can be 0 to 1000 of those groups. Note that you can't write (?>\?\?)* since the expression needs to have a maximum length (a maximum number of groups). However, you should be able to increase the upper bound by a lot, depending on the rest of the expression
  • [^\?](?>\?\?)... means the groups of 2 question marks must be preceded by some character but not a question mark (otherwise you'd match the odd case)

Need Your Help

How to hide info button in ZBar Bar Code Reader for iOS6.0 and above

iphone ios objective-c xcode zbar-sdk

I am using ZBar Bar Code Reader for iOS 5.0 and above in my iOS App.

Where to find the status of the WPF Image Control?

.net wpf image .net-3.5

One of the nice feature of the Image control is that we can specified an Uri as the ImageSource and the image is automatically downloaded for us. This is great! However, the control doesn't seem to...

Django POST request to my view from Pyres worker - CSRF token

python django django-csrf

I'm using Pyres workers to do some processing of data users enter in a form. Their processing is done by a view on my form, which I make a POST request to, with data including the data to process a...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.