Using ? with sed
I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a ?. Here's what I tried:
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and nothing was returned. Then I added a ? to the string being analyzed:
echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
So, it looks like the ? used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1 for file_1 and file_1.gz. What's the best way to do that in a bash script if execution time is critical?
The equivalent to x? is \(x\|\).
However, many versions of sed support an option to enable "extended regular expressions" which includes ?. In GNU sed the flag is -r. Note that this also changes unescaped parens to do grouping. eg:
echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p'
Actually, there's another bug in your regex which is that the greedy .* in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to * as far as I know, but you can use | to work around this. | in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:
echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/'
This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |) so we just concatenate them to get the value we want.