Splitting a string after punctuation while including punctuation

I'm trying to split a string of words into a list of words via regex. I'm still a bit of a beginner with regular expressions.

I'm using nltk.regex_tokenize, which is yielding results that are close, but not quite what I want.

This is what I have so far:

>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"    
>>> pattern = r"""(?x)
    #words with internal hyphens
    | \w+(-\w+)*
    #ellipsis
    | \.\.\.
    #other punctuation tokens
    | [][.,;!?"'():-_`]
    """ 
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']

I would like to have the output as follows:

[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']

I have a workaround for the "emoticons", so what I'm most concerned with are quotes.

Answers


It seems that the desired output is not consistent with your input sentence

  1. [u"qu'", u'on'] : I can't figure out from where did these two matches were determined from your sentence
  2. Why u'.' was not part of u'hyper-cool' (Assuming you want the punctuation as part of the word.
  3. Why u"'" was not part of u"C'". (Assuming you want the punctuation as part of the word.

Also if you just want regex split, is there any reason why you are using nltk apart from splitting the lines? I have no experience with nltk so would be proposing just a regex solution.

>>> sentence
u"d\xe9test\xe9 Rochard ! m'\xe9tais \xe0... 'C'est hyper-cool.' :) :P"
>>> pattern=re.compile(
    u"(" #Capturing Group
    "(?:" #Non Capturing
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    "[\w\-]+"                           #Alphanumeric Unicode Word with hypen
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    ")"
    "|(?:[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]+)" #1- punctuation
     ")",re.UNICODE)
>>> pattern.findall(sentence)
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0.', u'..', u"'C'", u'est', u'hyper-cool.', u"'", u':)', u':P']

See if this works for you

If you need more information on Capturing Group, Non-Capturing Group, Character Class, Unicode Match and findall I would suggest you take a cursory glance on the re package of python. Also I am not sure if the way you are continuing string in multiple lines is appropriate in this scenario. If you need more information on splitting string across lines (not multi-line strings) please have a look into this.


Need Your Help

RSpec tests failing when dev and prod work - RSpec quirk, virtual attribute, or form issue?

rspec ruby-on-rails-3.1 factory-girl

I'm writing a basic application using RoR, and I'm testing with RSPec (and Factory Girl). My app is working on both Dev and Prod, but I can't get all of my RSpec tests to pass. I suspect this is some

Rails gem for historical breadcrumb effect

ruby-on-rails

Folks, this question is probably an easy one, but is there any Rails gem that supports historical breadcrumb trail?

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.