preg replace - Strip out HTML and Malicious code leaving punctuation and foreign languages in PHP -


function stripalpha( $item ) {     $search     = array(           '@<script[^>]*?>.*?</script>@si'   // strip out javascript          ,'@<style[^>]*?>.*?</style>@siu'    // strip style tags          ,'@<[\/\!]*?[^<>]*?>@si'            // strip out html tags         ,'@<![\s\s]*?–[ \t\n\r]*>@'         // strip multi-line comments including cdata         ,'/\s{2,}/'         ,'/(\s){2,}/'     );     $pattern    = array(          '#[^a-za-z ]#'                     // non alpha characters         ,'/\s+/'                            // more 1 whitespace     );     $replace    = array(          ''         ,' '     );     $item = preg_replace( $search, '', html_entity_decode( $item ) );     $item = trim( preg_replace( $pattern, $replace, strip_tags( $item ) ) );      return $item; } 

one person suggested replacing entire script 1 liner:

$clear = preg_replace('/[^a-za-z0-9\-]/', '', urldecode($_get['id'])); 

but gives error $_get command - unknown variable id

what i'm looking simplest script remove html code , weird characters, replacing carriage returns spaces , leaving punctuation dots commas , exclamation points.

there lot of similar questions none seem answer question right , scripts strip away characters including sentence punctuation , foreign arabic fonts or spanish.

for example if string contains www.mygreatwebsite.com

the cleaner script return wwwmygreatwebsitecom looks weird.

if excited 'hey great website! ' removes exclamation points.

all similar questions out there i've looked remove characters....

i'd leave in punctuation , foreign language characters 1 simple regex command clears out stuff people paste forms, leaves punctuation.

naturally carriage returns replaced spaces.

any suggestions?

to remove html code, it's easy, use strip_tags

$text = strip_tags($html); 

but works if string doesn't contain css or javascript code.

so better way deals problem use domdocument , xpath find text nodes haven't style or script tag ancestor:

$dom = new domdocument; $dom->loadhtml($html);  $xp = new domxpath($dom);  $textnodelist = $xp->query('//text()[not(ancestor::script) , not(ancestor::style)]');  $text = '';  foreach($textnodelist $textnode) {     $text .= ' '. $textnode->nodevalue; } 

to replace weird characters , white-space characters except punctuation space:

$text = preg_replace('~[^\pp\pl\pn]+~u', ' ', $text); 

where \pp character class punctuation characters, \pl letters, \pn digits. (to more precise characters want preserve, take @ available character classes here (search "unicode character properties"))

obviously, can trim text finish:

$text = trim($text); 

Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -