CountVectorizer to Extract Features from Texts in Python, in Detail | by Rashida Nasrin Sucky | Oct, 2023

Photo by Towfiqu barbhuiya on Unsplash

Everything you need to know to use CountVectorizer efficiently in Sklearn

Rashida Nasrin Sucky

The most basic data processing that any (NLP) requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation on it.

There are multiple methods available for this text-to-numeric data conversion. This will explain one of the most basic vectorizers, the CountVectorizer method in the library.

This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.

In the following block:

  • We will import the CountVectorizer method.
  • Call the method.
  • Fit the text data to the CountVectorizer method and, convert that to an .
import  as pd 
from sklearn.feature_extraction.text import CountVectorizer

#This is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I my aunt.\
I am trying to learn how to use count vectorizer."]

cv= CountVectorizer()
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
dtype=int64)

Here I have the numeric values representing the text data above.

How do we know which values represent which words in the text?

To make that clear, it will be helpful to convert the array into a DataFrame where column names will be the words themselves.

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Now, it clearly. The value of the word ‘also’ is 1 which means ‘also’ appeared only once in the test. The word ‘aunt’ came twice in the text. So, the value of the word ‘aunt’ is 2.

In the last example, all the sentences were in one string. So, we got only one row of data for four sentences. Let’s rearrange the text and…

Source link