Organizational Research By

Surprising Reserch Topic

how to replace text urls and exclude urls in html tags


how to replace text urls and exclude urls in html tags  using -'php,html,regex,url'

I need you help here.

I want to turn this:

sometext sometext http://www.somedomain.com/index.html sometext sometext


into:

sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext


I have managed it by using this regex:

preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);


The problem is it’s also replacing the the img URL, for example:

sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext


is turned into:

sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext


Please help.
    

asked Sep 14, 2015 by HVIDevinyezb
0 votes
77 views



Related Hot Questions

7 Answers

0 votes

Streamlined version of Gumbo's above:

$html = <<< HTML


This is a text with a link" rel="nofollow" target="_blank">http://example.com/1">link and another http://example.com/2" rel="nofollow" target="_blank">http://example.com/2">http://example.com/2 and also another http://example.com with the latter being the only one that should be replaced. There is also images in this text, like " rel="nofollow" target="_blank">http://example.com/foo"/> but these should not be replaced either. In fact, only URLs in text that is no a descendant of an anchor element should be converted to a link.

HTML;

Let's use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
    '/html/body//text()[
        not(ancestor::a) and (
        contains(.,"http://") or
        contains(.,"https://") or
        contains(.,"ftp://") )]'
);

The XPath above will give us a TextNode with the following data:

 and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like 

Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.

Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.

foreach ($texts as $text) {
    $fragment = $dom->createDocumentFragment();
    $fragment->appendXML(
        preg_replace(
            "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
            '$1',
            $text->data
        )
    );
    $text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);

and this will then output:


This is a text with a link" rel="nofollow" target="_blank">http://example.com/1">link and another http://example.com/2" rel="nofollow" target="_blank">http://example.com/2">http://example.com/2 and also another http://example.com">http://example.com with the latter being the only one that should be replaced. There is also images in this text, like " rel="nofollow" target="_blank">http://example.com/foo"/> but these should not be replaced either. In fact, only URLs in text that is no a descendant of an anchor element should be converted to a link.

answered Sep 14, 2015 by SethRtgjokv
0 votes

You shouldn’t do that with regular expressions – at least not regular expressions only. Use a proper HTML DOM parser like the one of PHP’s DOM library instead. You then can iterate the nodes, check if it’s a text node and do the regular expression search and replace the text node appropriately.

Something like this should do it:

$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
    // for every child node in each element
    foreach ($elem->childNodes as $node) {
        if ($node->nodeType === XML_TEXT_NODE) {
            // split the text content to get an array of 1+2*n elements for n URLs in it
            $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
            $n = count($parts);
            if ($n > 1) {
                $parentNode = $node->parentNode;
                // insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
                for ($i=1; $i<$n; $i+=2) {
                    $a = $doc->createElement('a');
                    $a->setAttribute('href', $parts[$i]);
                    $a->setAttribute('target', '_blank');
                    $a->appendChild($doc->createTextNode($parts[$i]));
                    $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
                    $parentNode->insertBefore($a, $node);
                }
                // insert the last part before the original DOMText node
                $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
                // remove the original DOMText node
                $node->parentNode->removeChild($node);
            }
        }
    }
}

Ok, since the DOMNodeList‍s of getElementsByTagName and childNodes are live, every change in the DOM is reflected to that list and thus you cannot use foreach that would also iterate the newly added nodes. Instead, you need to use for loops instead and keep track of the elements added to increase the index pointers and at best pre-calculated array boundaries appropriately.

But since that is quite difficult in such a somehow complex algorithm (you would need one index pointer and array boundary for each of the three for loops), using a recursive algorithm is more convenient:

function mapOntoTextNodes(DOMNode $node, $callback) {
    if ($node->nodeType === XML_TEXT_NODE) {
        return $callback($node);
    }
    for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {
        $nodesChanged = 0;
        switch ($node->childNodes->item($i)->nodeType) {
            case XML_ELEMENT_NODE:
                $nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);
                break;
            case XML_TEXT_NODE:
                $nodesChanged = $callback($node->childNodes->item($i));
                break;
        }
        if ($nodesChanged !== 0) {
            $n += $nodesChanged;
            $i += $nodesChanged;
        }
    }
}
function foo(DOMText $node) {
    $pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
    $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
    $n = count($parts);
    if ($n > 1) {
        $parentNode = $node->parentNode;
        $doc = $node->ownerDocument;
        for ($i=1; $i<$n; $i+=2) {
            $a = $doc->createElement('a');
            $a->setAttribute('href', $parts[$i]);
            $a->setAttribute('target', '_blank');
            $a->appendChild($doc->createTextNode($parts[$i]));
            $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
            $parentNode->insertBefore($a, $node);
        }
        $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
        $parentNode->removeChild($node);
    }
    return $n-1;
}

$str = '
sometext http://www.somedomain.com/index.html sometext sometext sometext
'; $doc = new DOMDocument(); $doc->loadHTML($str); $elems = $doc->getElementsByTagName('body'); mapOntoTextNodes($elems->item(0), 'foo');

Here mapOntoTextNodes is used to map a given callback function onto every DOMText node in a DOM document. You can either pass the whole DOMDocument node or just a specific DOMNode (in this case just the BODY node).

The function foo is then used to find and replace the plain URLs in the DOMText node’s content by splitting the content string into non-URL‍/‍URL parts using preg_split while capturing the used delimiter resulting in an array of 1+2·n items. Then the non-URL parts are replaced by new DOMText nodes and the URL parts are replaced by new A elements that are then inserted before the origin DOMText node that is then removed at the end. Since this mapOntoTextNodes walks recursively, it suffices to just call that function on a specific DOMNode.

answered Sep 14, 2015 by RussellReece
0 votes

thanks for the reply, but its still does work. i have fixed using this function:

function livelinked ($text){
        preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs);
        foreach ($ccs[3] as $cc) {
           if (strpos($cc,"jpg")==false  && strpos($cc,"gif")==false && strpos($cc,"png")==false ) {
              $old[] = "http://".$cc;
              $new[] = ''.$cc.'';
           }
        }
        return str_replace($old,$new,$text);
}
answered Sep 14, 2015 by EarleneAguil
0 votes

If you'd like to keep using a regex (and in this case, a regex is quite appropriate), you can have the regex match only URLs that "stand alone". Using a word boundary escape sequence (\b), you can only have the regex match where http is immediately preceded by whitespace or the beginning of the text:

preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
            // ^^ thar she blows

Thus, "http://..." won't match, but http:// as its own word will.

answered Sep 14, 2015 by GildaProudl
0 votes

DomDocument is more mature and runs much faster, so it's just an alternative if someone wants to use PHP Simple HTML DOM Parser:

http://www.somedomain.com/index.html sometext sometext
http://www.somedomain.com/index.html" rel="nofollow" target="_blank">http://www.somedomain.com/index.html">http://www.somedomain.com/index.html
sometext sometext  sometext sometext');

foreach ($html->find('text') as $element)
{
    // you can add any tag into the array to exclude from replace
    if (!in_array($element->parent()->tag, array('a')))
        $element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $element->innertext);
}

echo $html;
answered Sep 14, 2015 by NannieLillic
0 votes

match a whitespace (\s) at the start and end of the url string, this will ensure that

"http://url.com" 

is not matched by

http://url.com 

is matched;

answered Sep 14, 2015 by LorPleasant

...