Regular Expressions, PHP and Newline Characters
Nov 27th, 2006
Ahhhh, regular expressions. They are so handy, but can be such a pain in the ass to use. While coding up a basic script to do a small amount of screen scraping, I remembered a problem I encountered a couple of years ago; one that I was unable to solve at the time. It involves using regular expressions to match data in a string with newline characters. For the initiated, newline characters ("\n" on *NIX) create multiple lines . I.E., the newline character tells the browser or software program to begin a new line. In PHP, you can use echo "\n" to create a new line in the browser output (would not be viewable on the screen, only when viewing source), which can be handy when you are iterating through an array and spewing out lots of data to the screen.
Back to the problem at hand! if you are using PHP's PCRE (Perl Compatible Regular Expressions - i.e. preg_match) to match text, you need to realize that the pattern will only match on a single line, even if you pass in a string that contains many lines.
The above code shows this. If you look at your browser, you will see the bold text. If you view source, you will see separate three lines.
If we try to match the entire div like the code above, it fails and outputs an empty array. This is because the regular expression is only looking at the first line in the string, and therefore does not see the div closed out in the third line.
The code above adds a trailing option. There are a variety of trailing options available, but the "s" above (at the end of the pattern) tells the regular expression to make periods match any character, including newline. Now, $matches[0] contains the code for the entire div.
Back to the problem at hand! if you are using PHP's PCRE (Perl Compatible Regular Expressions - i.e. preg_match) to match text, you need to realize that the pattern will only match on a single line, even if you pass in a string that contains many lines.
<?php
$string = "<div>\\n<b>This is the second line</b>\\n</div>";
echo $string;
?>The above code shows this. If you look at your browser, you will see the bold text. If you view source, you will see separate three lines.
<?php
$string = "<div>\\n<b>This is the second line</b>\\n</div>";
preg_match('|<div>.*</div>|',$string,$matches);
print_r($matches);
?>If we try to match the entire div like the code above, it fails and outputs an empty array. This is because the regular expression is only looking at the first line in the string, and therefore does not see the div closed out in the third line.
<?php
$string = "<div>\\n<b>This is the second line</b>\\n</div>";
preg_match('|<div>.*</div>|s',$string,$matches);
print_r($matches);
?>The code above adds a trailing option. There are a variety of trailing options available, but the "s" above (at the end of the pattern) tells the regular expression to make periods match any character, including newline. Now, $matches[0] contains the code for the entire div.
Posted In: PHP, regular expressions | 15 comments