preg replace - Strip out HTML and Malicious code leaving punctuation and foreign languages in PHP -
function stripalpha( $item ) { $search = array( '@<script[^>]*?>.*?</script>@si' // strip out javascript ,'@<style[^>]*?>.*?</style>@siu' // strip style tags ,'@<[\/\!]*?[^<>]*?>@si' // strip out html tags ,'@<![\s\s]*?–[ \t\n\r]*>@' // strip multi-line comments including cdata ,'/\s{2,}/' ,'/(\s){2,}/' ); $pattern = array( '#[^a-za-z ]#' // non alpha characters ,'/\s+/' // more 1 whitespace ); $replace = array( '' ,' ' ); $item = preg_replace( $search, '', html_entity_decode( $item ) ); $item = trim( preg_replace( $pattern, $replace, strip_tags( $item ) ) ); return $item; }
one person suggested replacing entire script 1 liner:
$clear = preg_replace('/[^a-za-z0-9\-]/', '', urldecode($_get['id']));
but gives error $_get command - unknown variable id
what i'm looking simplest script remove html code , weird characters, replacing carriage returns spaces , leaving punctuation dots commas , exclamation points.
there lot of similar questions none seem answer question right , scripts strip away characters including sentence punctuation , foreign arabic fonts or spanish.
for example if string contains www.mygreatwebsite.com
the cleaner script return wwwmygreatwebsitecom looks weird.
if excited 'hey great website! ' removes exclamation points.
all similar questions out there i've looked remove characters....
i'd leave in punctuation , foreign language characters 1 simple regex command clears out stuff people paste forms, leaves punctuation.
naturally carriage returns replaced spaces.
any suggestions?
to remove html code, it's easy, use strip_tags
$text = strip_tags($html);
but works if string doesn't contain css or javascript code.
so better way deals problem use domdocument , xpath find text nodes haven't style or script tag ancestor:
$dom = new domdocument; $dom->loadhtml($html); $xp = new domxpath($dom); $textnodelist = $xp->query('//text()[not(ancestor::script) , not(ancestor::style)]'); $text = ''; foreach($textnodelist $textnode) { $text .= ' '. $textnode->nodevalue; }
to replace weird characters , white-space characters except punctuation space:
$text = preg_replace('~[^\pp\pl\pn]+~u', ' ', $text);
where \pp
character class punctuation characters, \pl
letters, \pn
digits. (to more precise characters want preserve, take @ available character classes here (search "unicode character properties"))
obviously, can trim text finish:
$text = trim($text);
Comments
Post a Comment