Regular Expressions, PHP and Newline Characters

Ahhhh, regular expressions. They are so handy, but can be such a pain in the ass to use. While coding up a basic script to do a small amount of screen scraping, I remembered a problem I encountered a couple of years ago; one that I was unable to solve at the time. It involves using regular expressions to match data in a string with newline characters. For the initiated, newline characters (“\n” on *NIX) create multiple lines . I.E., the newline character tells the browser or software program to begin a new line. In PHP, you can use echo “\n” to create a new line in the browser output (would not be viewable on the screen, only when viewing source), which can be handy when you are iterating through an array and spewing out lots of data to the screen.

Back to the problem at hand… if you are using PHP’s PCRE (Perl Compatible Regular Expressions – i.e. preg_match) to match text, you need to realize that the pattern will only match on a single line, even if you pass in a string that contains many lines.


<?php
$string 
“<div>\n<b>This is the second line</b>\n</div>”;
echo 
$string;

?>


The above code shows this. If you look at your browser, you will see the bold text. If you view source, you will see separate three lines.


<?php
$string 
“<div>\n<b>This is the second line</b>\n</div>”;
preg_match(‘|<div>.*</div>|’,$string,$matches);
print_r($matches);

?>


If we try to match the entire div like the code above, it fails and outputs an empty array. This is because the regular expression is only looking at the first line in the string, and therefore does not see the div closed out in the third line.


<?php
$string 
“<div>\n<b>This is the second line</b>\n</div>”;
preg_match(‘|<div>.*</div>|s’,$string,$matches);
print_r($matches);

?>


The code above adds a trailing option. There are a variety of trailing options available, but the “s” above (at the end of the pattern) tells the regular expression to make periods match any character, including newline. Now, $matches[0] contains the code for the entire div.

5 Responses to “Regular Expressions, PHP and Newline Characters”

  1. juv Says:

    thanks for the “s” info.

    I was using ([n]*.)* but if PCRE is limited in memory (number of returned matches is limited) you’ll get a segmentation fault with PHP

    Hosting companies limit the max number of returns as it can be considered a security hole

  2. Administrator Says:

    I didn’t know that. I have never run into a problem with regular expressions and memory, but the strings I am matching against are never that large. Regular expressions are very memory intensive compared with other string functions…

  3. Zboruri ieftine Says:

    Finally an answer to my troubles. That non-working preg_match line almost killed me. Damn dot! :)

    Thanks

  4. Nasir Says:

    Thanks heaps - that bloody ’s’ at the end of the match statement was the winner!

  5. Jerome Says:

    Thank you

    I encountered a similar problem.

    I googled :
    php regular expression “any character including newline”

    and I found your page that helped me to solve my problem.

Leave a Reply