python - Bag of words representation using sklearn plus Snowballstemmer -

i have list songs, like

list2 = ["first song", "second song", "third song"...]

here code:

from sklearn.feature_extraction.text import countvectorizer nltk.corpus import stopwords  vectorizer = countvectorizer(stop_words=stopwords.words('english')) bagofwords = vectorizer.fit(list2) bagofwords = vectorizer.transform(list2)

and it's working, want stem list of words.

i've tried make way

def tokeni(self,data):         return [snowballstemmer("english").stem(word) word in data.split()]  vectorizer = countvectorizer(stop_words=stopwords.words('english'),                               tokenizer=self.tokeni)

but didn't work. doing wrong?

update : tokenizer have words "oh...", "s-like..." , "knees," when without tokenizer don't have words dots, commas, etc

you can pass custom preprocessor should work well, retain functionality of tokenizer:

from sklearn.feature_extraction.text import countvectorizer nltk.stem import snowballstemmer  list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"]  def preprocessor(data):         return " ".join([snowballstemmer("english").stem(word) word in data.split()])  vectorizer = countvectorizer(preprocessor=preprocessor).fit(list2) print vectorizer.vocabulary_  # should print this: # {'raining': 2, 'raini': 1, 'rain': 0}

Search This Blog

Braziel

python - Bag of words representation using sklearn plus Snowballstemmer -

Comments

Post a Comment

Popular posts from this blog

android - How to save instance state of selected radiobutton on menu -

python 3 IndexError: list index out of range -

IF statement in MySQL trigger -