Split complicated strings using Java and Regex

Using Java and regex, I want to extract strings from a line of text. The text can be in following format -

  1. key1(value1) key2(value2)
  2. key1(value1) key2
  3. key1 key2(value2)
  4. key1 key2
  5. key1

Am successfully able to extract the keys and values when Type #1 is used where I can split the text using space and then use following pattern to extract keys

Pattern p = Pattern.compile("\\((.*?)\\)",Pattern.DOTALL);

A complicated code logic for counting the occurance of "(" and matching it with occurence of the space can be used for Case #2 and Case #3, however, the code becomes way too long. Multiple complication arise when spaces are present in values too because then, splitting text becomes problematic.

Is there a better regex splitting/holiding I can use for selective Cases depicted above?

Answers


Consider the following powershell example of a universal regex.

(?<=^|[\s)\n])[\n]*([^(\n\s]*)([(]([^)\n]*)[)])?

Example

    $Matches = @()
    $String = 'key1(value1) key2(value2)
key3(value3) key3.5
key4 key5(value5)  GoofyStuff(I like kittens)
key6 key7 ForReal-Things(be sure to vote)
key8'
    Write-Host start with 
    write-host $String
    Write-Host
    Write-Host found
    ([regex]'(?<=^|[\s)\n])([^(\n\s]*)([(]([^)\n]*)[)])?').matches($String) | foreach {
        if ($_.Groups[1].Value) {
            write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'"
            if ($_.Groups[3].Value) {
                write-host "value at $($_.Groups[3].Index) = '$($_.Groups[3].Value)'"
                } # end if
            } # end if
        } # next match

Yields

start with
key1(value1) key2(value2)
key3(value3) key3.5
key4 key5(value5)  GoofyStuff(I like kittens)
key6 key7 ForReal-Things(be sure to vote)
key8

found
key at 0 = 'key1'
value at 5 = 'value1'
key at 13 = 'key2'
value at 18 = 'value2'
key at 27 = 'key3'
value at 32 = 'value3'
key at 40 = 'key3.5'
key at 48 = 'key4'
key at 53 = 'key5'
value at 58 = 'value5'
key at 67 = 'GoofyStuff'
value at 78 = 'I like kittens'
key at 95 = 'key6'
key at 100 = 'key7'
key at 105 = 'ForReal-Things'
value at 120 = 'be sure to vote'
key at 138 = 'key8'

Summary

  • (?<=^|[\s)\n]*) looks for the beginning of a key, each key is assumed to be at the start of the string, or right after a \n, "(", or space - (?<=^|[\s)\n]*). This might not work in Java as there is a bug/feature in how java handles lookarounds with undefined sizes. (see also)
  • (?<=^|[\s)\n]) looks for the beginning of a key, each key is assumed to be at the start of the string, or right after a \n, "(", or space - (?<=^|[\s)\n]). This look around seems to work in C# and Powershell

  • ([^(\n\s]*) returns all characters up to the next "(", \n, or \s

  • ([(]([^)\n]*)[)])? returns the value inside the parans if it exists

    The extra logic inside the loop tests the Matches array to validate that key name or value was found. In powershell the $Matches is automatically populated with all matching items from the string.


Need Your Help

How to avoid class coupling when the specs insist on it

python design-patterns design architecture decoupling

I have two coupled classes DhcpServer and SessionManager. I got the following requirements in my specs that led to that coupling:

What is being encrypted when I use a salted CRYPT_MD5 to encrypt my password?

php security hash passwords

Using md5 on a string always produces an alpha-numeric encrypted result, ie: no symbols.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.