Matching across a line vs matching words regex

Why is it that when I match across new lines it would seem that I can't identify individual words. For example:

content = "COAL_STORIES
AUSTRALIA - blah blah blah
BOTSWANA – blah blah blah 

INDIA - blah

AUSTRALIA - blah blah blah
AUSTRALIA - blah blah blah
CHINA - blah blah blah


sections = content.scan(/\w.*_.*\b/)

Give and array:

    [0] "COAL_STORIES",

But if I try that using the 'm' flag everything gets matched:

sections = content.scan(/\w.*_.*\b/m) gives an array:

    [0] "COAL_STORIES\nAUSTRALIA - blah blah blah\nBOTSWANA – blah blah blah \n\nURANIUM_STORIES \nAUSTRALIA – blah\nINDIA - blah\n\nCOPPER_STORIES\nAUSTRALIA - blah blah blah\nAUSTRALIA - blah blah blah\nCHINA - blah blah blah\n\nALUMINIUM_STORIES"

As far as I can tell I'm still looking for the same word boundaries?


To elaborate on Casimir's comment:

.* is greedy... it will match the longest possible string it can, including newlines if you let it (which you can/did do by enabling multiline matching with \m).

In your first example .* will not match newlines, so \b is forced to match a word boundary on the same line as where \w matched.

In your second example .* will match across lines, so when \w matches your first character, \b is free to match any word boundary, even many lines away, as long as there's an _ somewhere between the two. Specifically, for you, it looks like:

  • \w matched the first character in your input: "C" from "COAL_STORIES"
  • .* matched everything up to "ALUMINUM" on the last line
  • _ matched "_"
  • .* matched "STORIES"
  • \b matched the end of "STORIES"

