Organizational Research By

Surprising Reserch Topic

Experts Most Trusted Topic


remove script tag from html content


remove script tag from html content  using -'php,regex,htmlpurifier'

I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only.
I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML
    

asked Sep 24, 2015 by akhilesh
0 votes
2 views



Related Hot Questions



Walkin Jobs Opening



Government Jobs Opening

#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

answered Sep 24, 2015 by tseetha
0 votes

Use the PHP DOMDocument parser.

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:



    
        
        
            hey
        
        
    
    
        hey
    

Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

answered Sep 24, 2015 by r3tt
0 votes

Shorter:

$html = preg_replace("//s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("//s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

answered Sep 24, 2015 by tushar2k6
0 votes

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

answered Sep 24, 2015 by android_master

...