python - Bag of words representation using sklearn plus Snowballstemmer -
i have list songs, like
list2 = ["first song", "second song", "third song"...] here code:
from sklearn.feature_extraction.text import countvectorizer nltk.corpus import stopwords vectorizer = countvectorizer(stop_words=stopwords.words('english')) bagofwords = vectorizer.fit(list2) bagofwords = vectorizer.transform(list2) and it's working, want stem list of words.
i've tried make way
def tokeni(self,data): return [snowballstemmer("english").stem(word) word in data.split()] vectorizer = countvectorizer(stop_words=stopwords.words('english'), tokenizer=self.tokeni) but didn't work. doing wrong?
update : tokenizer have words "oh...", "s-like..." , "knees," when without tokenizer don't have words dots, commas, etc
you can pass custom preprocessor should work well, retain functionality of tokenizer:
from sklearn.feature_extraction.text import countvectorizer nltk.stem import snowballstemmer list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"] def preprocessor(data): return " ".join([snowballstemmer("english").stem(word) word in data.split()]) vectorizer = countvectorizer(preprocessor=preprocessor).fit(list2) print vectorizer.vocabulary_ # should print this: # {'raining': 2, 'raini': 1, 'rain': 0}
Comments
Post a Comment