regex to extract all that in quotes

I am trying to write a regex to match all strings which appear in between enclosing characters (most likely " - double quotes). This is a scenario I commonly encounter while trying to parse a line in a csv file.

So I have a sample line like:

"Smith, John",25,"21/45, North Avenue",IBM

Tried the following regex:

"(.*)"

But it fetches somewhat as follows:

I am expecting output as follows:

Smith, John
25
21/45, North Avenue
IBM

The regex I have written is an attempt to capture what comes between " in my example. However, above is the output I am expecting.

There is a kind of ambiguity though: I am not looking for a match like: ,25,. This kinda makes me wonder if a regex is even feasible here.

What is the correct way to write this?

Answers


If you really want to roll your own CSV parser, you'll need to teach your regex a few rules:

  1. A field may be unquoted as long as it doesn't contains quotes, commas or newlines.
  2. A quoted field may contain any characters; quotes are escaped by doubling.
  3. Commas are used as separators.

So, to match one CSV field, you can use the following regex:

(?mx)       # Verbose, multiline mode
(?<=^|,)    # Assert there is a comma or start of line before the current position.
(?:         # Start non-capturing group:
 "          # Either match an opening quote, followed by
 (?:        # a non-capturing group:
  ""        #  Either an escaped quote
 |          #  or
  [^"]+     #  any characters except quotes
 )*         # End of inner non-capturing group, repeat as needed.
 "          # Match a closing quote.
|           # OR
 [^,"\r\n]+ # Match any number of characters except commas, quotes or newlines
)           # End of outer non-capturing group
(?=,|$)     # Assert there is a comma or end-of-line after the current position

See it live on regex101.com.


Please don't use regex for this, CSV should be handled by a parser.

Here is a ready-to-use parser: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

You can also use the OLEDB built-in parser: http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser

Hope this helps


Firstly, that will only capture one group. Secondly, you need to be non-greedy:

(?:"(.*?)")

This does not solve your problem of multiple matches in a single line. Here are two examples:

import re
string = '"Smith, John",25,"21/45, North Avenue",IBM'
pattern = r'(?:"(.*?)")'
re.findall(pattern, string)
> ['Smith, John', '21/45, North Avenue']

In C#:

string pattern = @"(?:\""(.*?)\"")";
string input = @"\""Smith, John\"",25,\""21/45, North Avenue\"",IBM'";
foreach (Match m in Regex.Matches(input, pattern)) 
    Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);

Need Your Help

Reading Java serialized object that has been split across two files?

java android serialization

I'm writing an Android application. One problem is your app cannot contain a file whose uncompressed size is bigger than about 1Mb. I have a serialized object that I want to load that totals about ...

WCF SOAP Service

.net wcf web-services

I am working on a WCF SOAP service, and I noticed something weird.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.