Regular Expressions, PHP and Newline Characters
Ahhhh, regular expressions. They are so handy, but can be such a pain in the ass to use. While coding up a basic script to do a small amount of screen scraping, I remembered a problem I encountered a couple of years ago; one that I was unable to solve at the time. It involves using regular expressions to match data in a string with newline characters. For the initiated, newline characters (”n” on *NIX) create multiple lines . I.E., the newline character tells the browser or software program to begin a new line. In PHP, you can use echo “n” to create a new line in the browser output (would not be viewable on the screen, only when viewing source), which can be handy when you are iterating through an array and spewing out lots of data to the screen.
Back to the problem at hand! if you are using PHP’s PCRE (Perl Compatible Regular Expressions - i.e. preg_match) to match text, you need to realize that the pattern will only match on a single line, even if you pass in a string that contains many lines.
<?php
$string = “<div>\n<b>This is the second line</b>\n</div>”;
echo $string;
?>
The above code shows this. If you look at your browser, you will see the bold text. If you view source, you will see separate three lines.
<?php
$string = “<div>\n<b>This is the second line</b>\n</div>”;
preg_match(‘|<div>.*</div>|’,$string,$matches);
print_r($matches);
?>
If we try to match the entire div like the code above, it fails and outputs an empty array. This is because the regular expression is only looking at the first line in the string, and therefore does not see the div closed out in the third line.
<?php
$string = “<div>\n<b>This is the second line</b>\n</div>”;
preg_match(‘|<div>.*</div>|s’,$string,$matches);
print_r($matches);
?>
The code above adds a trailing option. There are a variety of trailing options available, but the “s” above (at the end of the pattern) tells the regular expression to make periods match any character, including newline. Now, $matches[0] contains the code for the entire div.
April 12th, 2007 at 10:03 am
thanks for the “s” info.
I was using ([n]*.)* but if PCRE is limited in memory (number of returned matches is limited) you’ll get a segmentation fault with PHP
Hosting companies limit the max number of returns as it can be considered a security hole
April 12th, 2007 at 10:53 am
I didn’t know that. I have never run into a problem with regular expressions and memory, but the strings I am matching against are never that large. Regular expressions are very memory intensive compared with other string functions…
September 23rd, 2007 at 4:59 pm
Finally an answer to my troubles. That non-working preg_match line almost killed me. Damn dot!
Thanks
April 7th, 2008 at 8:57 am
Thanks heaps - that bloody ’s’ at the end of the match statement was the winner!
November 14th, 2008 at 3:05 pm
Thank you
I encountered a similar problem.
I googled :
php regular expression “any character including newline”
and I found your page that helped me to solve my problem.
June 4th, 2009 at 11:21 pm
Great man! I was looking at google for the answer to my problem and it was right here!
I added the ? sign to not be greedy and to get just the first closure. It’s something like
preg_match(‘|.*?|’,$string,$matches);
Thanks for the post!
July 7th, 2009 at 2:41 pm
Hi Robert.
Where you found that option? exists a page where we can know what options we can passed to preg_match?
that option (”s”) is very useful,and the option “i” is useful too but how many more exists?
July 7th, 2009 at 11:05 pm
I have an old O’reilly Programming PHP book that lists the trailing options. I have had it 5+ years and I just placed it in a pile of stuff I was going to donate to the Goodwill. For posterity, below are the options:
/regex/i - case-insensitive
/regex/s - make period (.) match any character including newline
/regex/x - remove whitespace
/regex/m - make caret (^) match after, and dollar sign ($) match before, internal newlines (n)
/regex/e - if the replacement string is PHP code, eval() it to get teh actual replacement string
/regex/U - reverses the greediness of the subpattern,* sand + now match as little as possible
/regex/u - causes pattern strings to be treated as UTF-8
/regex/X - causes a backslash followed by a character with no special meaning to trigger error
/regex/A - causes the beginning of the string to be anchored as if the first character of the pattern were ^
/regex/D - causes the $ character to match only at the end of the line
/regex/S - causes the expression parser to more carefully examine the structure of the pattern, so it may run slightly faster next time (such as in a loop)
July 31st, 2009 at 6:51 pm
Thanks!!! You saved my life at 4 in the morning!!!
November 26th, 2009 at 6:38 am
Viewing this in the current version of Firefox 3.5.5, seems to have some encoding problems, so many of the interesting characters, such as ‘n’ get garbled.
Still helpful though.
December 2nd, 2009 at 12:42 am
Yeah, looks like things got garbled when I moved the site over to a new server. Not sure how, might have been the mysqdldump? Grrrr… I’ll fix it one of these days.
December 16th, 2009 at 10:46 am
What a shame about those garbled characters, makes this post completely useless now that you can’t actually see the proper solution
December 18th, 2009 at 7:09 pm
You’ve shamed me into fixing the formatting errors…