Sunday, May 1, 2011

php/regex: remove useless paragraph tags from string

if I have a string like

<p>&nbsp;</p>
<p></p>
<p class="a"><br /></p>
<p class="b">&nbsp;</p>
<p>blah blah blah this is some real content</p>
<p>&nbsp;</p>
<p></p>
<p class="a"><br /></p>

how can I turn it into just

<p>blah blah blah this is some real content</p>

needs to pick up nbsps and regular spaces

From stackoverflow
  • This regex will work against your example:

    <p[^>]*>(?:\s+|(?:&nbsp;)+|(?:<br\s*/?>)+)*</p>
  • $result = preg_replace('#<p[^>]*>(\s|&nbsp;?)*</p>#', '', $input);

    This doesn't catch literal nbsp characters in the output, but that's very rare to see.

    Since you're dealing with HTML, if this is user-input I might suggest using HTML Purifier, which will also deal with XSS vulnerabilities. The configuration setting you want there to remove empty p tags is %AutoFormat.RemoveEmpty.

  • As the original replier stated, regex isn't the best solution here, what you want is some sort of html stripper.

    A function on this site: http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

    Should help you out, you just need to use a bit of string manipulation to get the new lines and what not back to the format you want.

0 comments:

Post a Comment