Regex/wildcard replace on a string PHP
I have a mass of text that gets loaded into the header, and within it lies this link.
<link rel="canonical" href="could_be_anything_here_at_all" />
I'm looking to replace it with a new value, but the href changes based on the page meaning a simple str_replace isn't possible.
I've looked at using preg_replace, but can't get my head around what seems like a simple problem.
$regex = '/(^<link rel="canonical")(\/>$)/'; $match = preg_match_all($regex, $content, $matches); var_dump($matches);
- The / / start and end the expression?
- The () indicate separate 'expressions' which have to be matched for the string to be returned?
- The ^ filters for results that begin with the following string?
- The $ filters for results that end with the following string?
So I'm looking for a string that begins with <link rel="canonical" and ends with />
I've shown the steps I'm after, and my stab at it. Please help me write and ultimately understand how to do it. I'm really at a loss on this one.
The regular expression you've written is all over the place. Let's go over the pattern:
Whatever happens, it will begin with <link and end with a ></link> or /> (gotta account for those pesky non-respecting-of-standards web buccaneers). You're looking for the rel parameter, if it has one, and it needs to be canonical.
We can start writing the regular expression: #<link([^>]+)(/>|></link>)#is. This will map all link tags. You can then parse the parameters using simple strpos calls.
If you are sure that rel="canonical" will be the first parameter of the link tag, you can expand the regular expression further into #<link rel="canonical" href="?'?([^"']+)"?'?(/>|></link>)#is. This will map it in order, which is fine if you are sure that this will be the order.
In order of appearance:
[^>]+ matches anything but a > character one or more times
the is flags stand for: case-insensitive, do not break on newline
"?'? matches 0 or one ", followed by 0 or 1 '
If anything else is unclear, let me know.
Edit: to answer your questions
The / / start and end the expression? They're called delimiters, and they "encase" the expression. The Perl regular expression engine allows for flags to be set regarding the expression (i, s, g, b, etc), and those have to be out of the expression. They go after the delimiter - and this is the point of the delimiter. You can use any character you like - it will pick the furthest two repeating ones. People tend to use / due to JS using that single char for them - I tend to prefer # in PHP to clear / ambiguities arising from closing HTML tags.
The () indicate separate 'expressions' which have to be matched for the string to be returned? () matches a subset and allows you to get it back in the results if you specify a variable for the matches. Every part of the regular expression can use wildcards & co, but only stuff encased in () will be returned in matches
- The ^ filters for results that begin with the following string? Nope. The ^ outside a  range will match anything that starts with the following string full stop. On a new line, effectively, not just "words".
- The $ filters for results that end with the following string? Same as above, just "end" rather than "start".