Regular Expressions, PHP and Newline Characters

Ahhhh, regular expressions. They are so handy, but can be such a pain in the ass to use. While coding up a basic script to do a small amount of screen scraping, I remembered a problem I encountered a couple of years ago; one that I was unable to solve at the time. It involves using regular expressions to match data in a string with newline characters. For the initiated, newline characters ("\n" on *NIX) create multiple lines . I.E., the newline character tells the browser or software program to begin a new line. In PHP, you can use echo "\n" to create a new line in the browser output (would not be viewable on the screen, only when viewing source), which can be handy when you are iterating through an array and spewing out lots of data to the screen. Back to the problem at hand! if you are using PHP's PCRE (Perl Compatible Regular Expressions - i.e. preg_match) to match text, you need to realize that the pattern will only match on a single line, even if you pass in a string that contains many lines.
<?php
$string 
"<div>\\n<b>This is the second line</b>\\n</div>";
echo 
$string;
?>
The above code shows this. If you look at your browser, you will see the bold text. If you view source, you will see separate three lines.
<?php
$string 
"<div>\\n<b>This is the second line</b>\\n</div>";
preg_match('|<div>.*</div>|',$string,$matches);
print_r($matches);
?>
If we try to match the entire div like the code above, it fails and outputs an empty array. This is because the regular expression is only looking at the first line in the string, and therefore does not see the div closed out in the third line.
<?php
$string 
"<div>\\n<b>This is the second line</b>\\n</div>";
preg_match('|<div>.*</div>|s',$string,$matches);
print_r($matches);
?>
The code above adds a trailing option. There are a variety of trailing options available, but the "s" above (at the end of the pattern) tells the regular expression to make periods match any character, including newline. Now, $matches[0] contains the code for the entire div.

Posted In: PHP, regular expressions

Commentary

juv 2007-04-12 17:03:58

thanks for the "s" info. I was using ([\n]*.)* but if PCRE is limited in memory (number of returned matches is limited) you'll get a segmentation fault with PHP Hosting companies limit the max number of returns as it can be considered a security hole

Administrator 2007-04-12 17:53:17

I didn't know that. I have never run into a problem with regular expressions and memory, but the strings I am matching against are never that large. Regular expressions are very memory intensive compared with other string functions...

Zboruri ieftine 2007-09-23 23:59:58

Finally an answer to my troubles. That non-working preg_match line almost killed me. Damn dot! :) Thanks

Nasir 2008-04-07 15:57:57

Thanks heaps - that bloody 's' at the end of the match statement was the winner!

Jerome 2008-11-14 23:05:23

Thank you I encountered a similar problem. I googled : php regular expression "any character including newline" and I found your page that helped me to solve my problem.

FedeX 2009-06-05 06:21:19

Great man! I was looking at google for the answer to my problem and it was right here! I added the ? sign to not be greedy and to get just the first closure. It's something like preg_match(‘|.*?|’,$string,$matches); Thanks for the post!

David Valdez 2009-07-07 21:41:01

Hi Robert. Where you found that option? exists a page where we can know what options we can passed to preg_match? that option ("s") is very useful,and the option "i" is useful too but how many more exists?

Administrator 2009-07-08 06:05:22

I have an old O'reilly Programming PHP book that lists the trailing options. I have had it 5+ years and I just placed it in a pile of stuff I was going to donate to the Goodwill. For posterity, below are the options: /regex/i - case-insensitive /regex/s - make period (.) match any character including newline /regex/x - remove whitespace /regex/m - make caret (^) match after, and dollar sign ($) match before, internal newlines (\n) /regex/e - if the replacement string is PHP code, eval() it to get teh actual replacement string /regex/U - reverses the greediness of the subpattern,* sand + now match as little as possible /regex/u - causes pattern strings to be treated as UTF-8 /regex/X - causes a backslash followed by a character with no special meaning to trigger error /regex/A - causes the beginning of the string to be anchored as if the first character of the pattern were ^ /regex/D - causes the $ character to match only at the end of the line /regex/S - causes the expression parser to more carefully examine the structure of the pattern, so it may run slightly faster next time (such as in a loop)

Gaurav Kumar 2009-08-01 01:51:57

Thanks!!! You saved my life at 4 in the morning!!!

JJ 2009-11-26 14:38:38

Viewing this in the current version of Firefox 3.5.5, seems to have some encoding problems, so many of the interesting characters, such as '\n' get garbled. Still helpful though.

Administrator 2009-12-02 08:42:33

Yeah, looks like things got garbled when I moved the site over to a new server. Not sure how, might have been the mysqdldump? Grrrr... I'll fix it one of these days.

bob 2009-12-16 18:46:46

What a shame about those garbled characters, makes this post completely useless now that you can't actually see the proper solution

Administrator 2009-12-19 03:09:17

You've shamed me into fixing the formatting errors...

Annabel 2011-01-23 17:50:01

Thanks, this was helpful.

Anonymous 2011-03-05 17:02:40

Thanks! Really helped out on a short notice!

moke 2012-02-12 01:55:19

wow.. thankss,... :)

Ankit Mittal 2012-02-23 05:17:50

Thanks for s info, I was tried for a long time to detect a new line as \n\r? or \s? but it not work. Again thanks.....

Ankit Mittal 2012-02-23 05:17:58

Thanks for s info, I was tried for a long time to detect a new line as \n\r? or \s? but it not work. Again thanks.....