python - Bag of words representation using sklearn plus Snowballstemmer -


i have list songs, like

list2 = ["first song", "second song", "third song"...] 

here code:

from sklearn.feature_extraction.text import countvectorizer nltk.corpus import stopwords  vectorizer = countvectorizer(stop_words=stopwords.words('english')) bagofwords = vectorizer.fit(list2) bagofwords = vectorizer.transform(list2) 

and it's working, want stem list of words.

i've tried make way

def tokeni(self,data):         return [snowballstemmer("english").stem(word) word in data.split()]  vectorizer = countvectorizer(stop_words=stopwords.words('english'),                               tokenizer=self.tokeni) 

but didn't work. doing wrong?

update : tokenizer have words "oh...", "s-like..." , "knees," when without tokenizer don't have words dots, commas, etc

you can pass custom preprocessor should work well, retain functionality of tokenizer:

from sklearn.feature_extraction.text import countvectorizer nltk.stem import snowballstemmer  list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"]  def preprocessor(data):         return " ".join([snowballstemmer("english").stem(word) word in data.split()])  vectorizer = countvectorizer(preprocessor=preprocessor).fit(list2) print vectorizer.vocabulary_  # should print this: # {'raining': 2, 'raini': 1, 'rain': 0} 

Comments

Popular posts from this blog

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -

javascript - Blogger related post gadget image Resize s72-c [ Need Expert Help ] -