python - Remove non printable words -


i'm trying store frequencies of words in text in python dictionary. apply normalizations text remove accent marks, symbols, punctuation, etc after of text still contains words raise unicodeencodeerror if printed. example '\xe2\x80\x9c'. how can rid of words?

you can use regex module (pip3 install regex) find ascii or non ascii letters:

>>> import regex >>> s='españa' >>> s 'españa' >>> regex.findall(r'\p{ascii}', s) ['e', 's', 'p', 'a', 'a'] >>> regex.findall(r'\p{ascii}', s) ['ñ'] 

you can use character class or negated character class:

>>> import re >>> re.findall(r'[a-za-z]', s) ['e', 's', 'p', 'a', 'a'] >>> re.findall(r'[^a-za-z]', s) ['ñ'] 

you can normalize without diacriticals:

>>> import unicodedata >>> ''.join((c c in unicodedata.normalize('nfd', s) if unicodedata.category(c) != 'mn')) 'espana' 

you can use same methods zalgo text

>>> s2.encode('utf-8') b's\xcc\x83\xcc\x84\xcd\xa7\xcc\x92\xcd\x8c\xcd\x8c\xcd\x9b\xcc\x8c\xcd\x8f\xcc\xa8\xcd\xa1\xcd\x94\xcc\xae\xcd\x89\xcd\x9a\xcc\xb0\xcc\xa3o\xcc\xbf\xcc\x8a\xcd\xa8\xcd\xa7\xcc\xa1\xcc\xaf\xcc\xab\xcc\xa9\xcc\xb0\xcd\x85\xcc\x96m\xcc\x8b\xcc\x87\xcc\x8c\xcd\xaa\xcd\xa1\xcd\x8f\xcc\xa6\xcc\xb0\xcd\x99\xcc\xa0\xcc\xa9\xcc\xa6\xcd\x88e\xcc\x82\xcc\x85\xcd\x8b\xcd\xa9\xcd\x8b\xcd\x8c\xcc\xa5\xcd\x9a\xcc\xba\xcc\xac \xcc\x94\xcd\xaa\xcd\x9f\xcc\xb6\xcc\xb8\xcc\xaa\xcc\xae\xcc\xb9z\xcc\x83\xcd\xa6\xcd\xa9\xcc\xb7\xcd\x98\xcc\x9d\xcc\x98\xcc\xa9\xcd\x9a\xcd\x9a\xcc\xac\xcc\x99a\xcc\x80\xcc\x88\xcc\x88\xcd\xa2\xcc\xb4\xcc\x98\xcc\xbb\xcc\xa6\xcc\xb2\xcc\x99l\xcc\x87\xcd\x82\xcc\x89\xcc\x86\xcc\x88\xcc\x94\xcc\x8d\xcc\xb7\xcc\xb6\xcd\xa1\xcc\xb3\xcc\xa5\xcc\x96\xcc\x9c\xcc\xae\xcc\xba\xcd\x99\xcc\x9dg\xcc\x8f\xcd\xa3\xcd\xad\xcc\x8c\xcd\x8b\xcc\x91\xcd\x83\xcc\x8f\xcc\xb0\xcd\x88o\xcd\xab\xcc\x90\xcd\xa4\xcd\x90\xcd\x84\xcd\xa3\xcd\x90\xcd\x9e\xcc\xa9\xcd\x96\xcd\x8e\xcc\xb9\xcc\xab\xcc\x96\xcc\xb9 \xcc\x87\xcc\xbf\xcc\x9b\xcc\x98\xcc\x97\xcd\x96\xcc\xae\xcc\x97t\xcd\xa6\xcc\xa0\xcc\x9f\xcc\xae\xcc\xb1\xcc\xb9\xcc\x9d\xcc\x9c\xcc\xade\xcd\x97\xcd\x83\xcc\xbe\xcd\xae\xcd\x8c\xcd\x84\xcc\xa7\xcc\xaa\xcc\x9d\xcc\xa6\xcc\xaa\xcc\xb1x\xcc\x84\xcc\x81\xcc\x8d\xcd\xa5\xcd\xad\xcd\xa9\xcd\x98\xcc\xa8\xcc\x9e\xcd\x9a\xcd\x93t\xcd\xac\xcc\x8b\xcc\x82\xcc\x87\xcc\xb4\xcd\x87\xcc\xb2\xcc\xab\xcd\x8e\xcd\x8d\xcc\xb9\xcd\x88' 

>

s̃̄ͧ̒͌͌͛̌͏̨͔̮͉͚̰̣͡o̡̯̫̩̰̖̿̊ͨͧͅm̋̇̌ͪ͡͏̦̰͙̠̩̦͈ê̥͚̺̬̅͋ͩ͋͌ ̶̸̪̮̹̔ͪ͟z̷̝̘̩͚͚̬̙̃ͦͩ͘à̴̘̻̦̲̙̈̈͢l̷̶̳̥̖̜̮̺͙̝̇͂̉̆̈̔̍͡g̰͈̏ͣͭ̌͋̑̓̏o̩͖͎̹̫̖̹ͫ̐ͤ͐̈́ͣ͐͞ ̛̘̗͖̮̗̇̿t̠̟̮̱̹̝̜̭ͦȩ̪̝̦̪̱͗̓̾ͮ͌̈́x̨̞͚͓̄́̍ͥͭͩ͘t̴͇̲̫͎͍̹͈ͬ̋̂̇

.

>>> ''.join(regex.findall(r'\p{ascii}', s2)) 'some zalgo text' >>> ''.join((c c in unicodedata.normalize('nfd', s2) if unicodedata.category(c) != 'mn')) 'some zalgo text' 

Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -