ignore html tags in preg replace

ignore html tags in preg replace  using -'php,html,preg-replace'

How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:

preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);

Thanks in advance!

asked Sep 14, 2015 by BennettSage
0 votes

1 Answer

0 votes

I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.

The general saying is: Don't parse HTML with regular expressions.

It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.

XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.

Then you only need to wrap those texts into the and you're done.

Edit: Finally some code ;)

First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:

'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'

$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).

This query will return all parents that contain textnodes which put together will be a string that contain your search term.

As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.

This is the base skeleton of the routine:

$str = '...'; # some XML

$search = 'text that span';

printf("Searching for: (%d) '%s'\n", strlen($search), $search);

$doc = new DOMDocument;
$xp = new DOMXPath($doc);

$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
    throw new Exception('Anchor element not found.');

// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
    throw new Exception('XPath failed.');

// process search results
foreach($r as $i => $node)
    $textNodes = $xp->query('.//child::text()', $node);

    // extract $search textnode ranges, create fitting nodes if necessary
    $range = new TextRange($textNodes);        
    $ranges = array();
    while(FALSE !== $start = strpos($range, $search))
        $base = $range->split($start);
        $range = $base->split(strlen($search));
        $ranges[] = $base;

    // wrap every each matching textnode
    foreach($ranges as $range)
        foreach($range->getNodes() as $node)
            $span = $doc->createElement('span');
            $span->setAttribute('class', 'search_hightlight');
            $node = $node->parentNode->replaceChild($span, $node);

For my example XML:

        This is some text that span across a page to search in.
    and more text that span

It produces the following result:

        This is some text that span across a page to search in.
    and more text that span

This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.

You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).

It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.

A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.

The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.

For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:

 while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))
answered Sep 14, 2015 by PhiTYG