Java Regex: Replace character unless preceded by other character

I am using Java and Regular Expressions and need to split some data into multiple entities. In my input a single quote character (') specifies an end of entity UNLESS its preceded by the escape character which is a question mark (?).

My RegEx is (?<!\\?)\\' and I'm using a Scanner to split the input into separate entities. So the following cases work correctly:

Hello'There  becomes 2 entities: Hello and There
Hello?'There remains 1 entity:   Hello?'There

However when I encounter the case where I want to escape the question mark it doesn't work. So:

Hello??'There     should become 2 entities:   Hello?? and There
Hello???'There    should become 1 entity:     Hello???'There
Hello????'There   should become 2 entities:   Hello???? and There
Hello?????'There  should become 1 entity:     Hello????'There
Hello?????There   should become 1 entity:     Hello????There
Hello??????There  should become 1 entity:     Hello?????There

Thus the rule is if there are an even number of question marks, followed by a quote, it should be split. If there are an odd number of question marks then it should not split.

Can someone help fix my Regex (hopefully with an explanation!) to cope with the multiple cases?




Don't use split() for this. That seems like the obvious solution, but it's much easier to match the entities themselves than it is to match the delimiters. Most of the regex-enabled languages have built-in methods for this, like Python's findall() or Ruby's scan(), but in Java we're still stuck with writing boilerplate. Here's an example:

Pattern p = Pattern.compile("([^?']|\\?.)+");
String[] inputs = {
for (String s : inputs)
  System.out.printf("%n%s :%n", s);
  Matcher m = p.matcher(s);
  while (m.find())
    System.out.printf("  %s%n",;


Hello??'There :

Hello???'There :

Hello????'There :

Hello?????'There :

Hello?????There :

Hello??????There :

The arbitrary-max-length gimmick Thomas used, besides being a disgusting hack (no offense intended, Thomas!), is unreliable because they keep introducing bugs into the code that handles that stuff. But don't think of this solution as another workaround; lookbehinds should never be your first resort, even in flavors like .NET where they work reliably and restriction-free.

Try this expression to match even cases: (?<=[^\?](?>\?\?){0,1000})'

  • (?<=...)' is a positive look behing, i.e. every ' which is preceded by the expression between (?<= and ) will match
  • (?>\?\?) is an atomic group of 2 consecutive question marks
  • (?>\?\?){0,1000} means there can be 0 to 1000 of those groups. Note that you can't write (?>\?\?)* since the expression needs to have a maximum length (a maximum number of groups). However, you should be able to increase the upper bound by a lot, depending on the rest of the expression
  • [^\?](?>\?\?)... means the groups of 2 question marks must be preceded by some character but not a question mark (otherwise you'd match the odd case)

Need Your Help

C++ performance vs. Java/C#

c# java c++ performance bytecode

My understanding is that C/C++ produces native code to run on a particular machine architecture. Conversely, languages like Java and C# run on top of a virtual machine which abstracts away the nat...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.