In today's article, I'll be taking you through text pre-processing in machine learning.
NLP, or Natural Language Processing, is the field where computers try to understand human language.NLP is like trying to teach a computer to understand your mom’s ‘fine’ when it means anything but ’ that’s impossible’. It’s about making machines realize that ‘I’m hungry’ might mean ‘Let’s order pizza,’ not ‘I’m about to eat my keyboard.
Photo by Pietro Jeng on Unsplash
When building machine learning we feed our models with data but when it comes to text, machines can't understand text like humans do so for this case we must pre-process the data for machines to understand and work effectively.
So buckle up and let me walk you through it. I'll be using the Kaggle coronavirus tweet dataset https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification
Let's import our basic libraries and load the dataset
import pandas as pdtweets=pd.read_csv("corona_NLP_train.csv")tweets.head()'''outputUserName ScreenName Location TweetAt OriginalTweet Sentiment0 3799 48751 London 16-03-2020 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... Neutral1 3800 48752 UK 16-03-2020 advice Talk to your neighbours family to excha... Positive2 3801 48753 Vagabonds 16-03-2020 Coronavirus Australia: Woolworths to give elde... Positive3 3802 48754 NaN 16-03-2020 My food stock is not the only one which is emp... Positive4 3803 48755 NaN 16-03-2020 Me, ready to go at supermarket during the #COV... Extremely Negative
Since we will be focusing on text pre-processing I will be using the OriginalTweet and sentiment columns only, so let's go ahead and drop the rest and rename our OriginalTweet column to text
tweets.drop(['UserName','ScreenName','Location','TweetAt'],axis=1,inplace=True)tweets.rename(columns={'OriginalTweet': 'text'}, inplace=True)tweets.head()'''output text Sentiment0 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... Neutral1 advice Talk to your neighbours family to excha... Positive2 Coronavirus Australia: Woolworths to give elde... Positive3 My food stock is not the only one which is emp... Positive4 Me, ready to go at supermarket during the #COV... Extremely Negative
The first thing I’ll do is to convert our text to lowercase “Why?” you might ask. Because computers see ‘Apple’ and ‘apple’ as different fruits unless you tell them otherwise.
tweets['text']=tweets['text'].str.lower()tweets.head()'''outputtext Sentiment0 @menyrbie @phil_gahan @chrisitv https://t.co/i... Neutral1 advice talk to your neighbours family to excha... Positive2 coronavirus australia: woolworths to give elde... Positive3 my food stock is not the only one which is emp... Positive4 me, ready to go at supermarket during the #cov... Extremely Negative
Our text is now all lowercase, next step I will be removing hyperlinks and punctuations
#removing hyperlinkstweets['text']=tweets['text'].str.replace(r'http\s+','',regex=True)#removing punctuationstweets['text']=tweets['text'].str.replace('[^a-zA-Z0-9\s]','',regex=True)tweets.head()'''output text Sentiment0 menyrbie philgahan chrisitv httpstcoifz9fan2pa... Neutral1 advice talk to your neighbours family to excha... Positive2 coronavirus australia woolworths to give elder... Positive3 my food stock is not the only one which is emp... Positive4 me ready to go at supermarket during the covid... Extremely Negative
As you can observe from the output, our text no longer has hyperlinks or punctions, removing them is key since they do not matter in predicting a tweet's sentiment is negative or positive.
Our next step will be removing stopwords. what are stopwords and why remove them
Imagine you’re at a party, and everyone’s talking, but all you hear are the words “the,” “a,” “an,” “in,” “on,” “at.” These are stopwords, the linguistic equivalent of filler words. They’re common words that don’t carry significant meaning on their own but are essential for grammar and flow in human language. For this reason, we remove stopwords since they will mean nothing in our models.
we use the nltk library to remove stopwords
import nltkfrom nltk.corpus import stopwordsnltk.download('stopwords')stop_words=set(stopwords.words('english'))tweets['text']=tweets['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
tweets.head()'''output text Sentiment0 menyrbie philgahan chrisitv httpstcoifz9fan2pa... Neutral1 advice talk neighbours family exchange phone n... Positive2 coronavirus australia woolworths give elderly ... Positive3 food stock one empty please dont panic enough ... Positive4 ready go supermarket covid19 outbreak im paran... Extremely Negative
The next step is one of my coolest, with textblob libraries it is now possible to correct spelling mistakes in our text.
from textblob import TextBlobtweets['text']=tweets['text'].apply(lambda x: str(TextBlob(x).correct()))
Our next will be tokenization, what is tokenization?
Tokenization is the process where we turn text into bite-sized pieces that computers can chew on. It’s the process of breaking down text into individual units or “tokens.” These tokens can be words, phrases, or even characters, depending on how you want to slice your linguistic.
let's consider the sentence “I love you Deon, you are my pokoloco” No one has ever told me this but let's continue, if we tokenized the statement we will have ‘I’ ‘love’ ‘you’ ‘Deon’ …..and so on
I will also talk about stemming and lemmatization then we will code all of them at once
stemming and lemmatization
Stemming is like using a machete on words. It’s a rough, rule-based process where you chop off the ends of words to get to their root form, or what you hope is close enough to the root i.e (interchanger -interchange).
Lemmatization, on the other hand, is like sending your words to a spa. It’s more sophisticated, considering the context and part of speech to return words to their base or dictionary form, known as a lemma. It uses a dictionary to understand the context and part of speech. For example, “better” would become “good” because it understands that “better” is a comparative form of “good”.
With all that being said let's write our code
import nltknltk.download('wordnet')nltk.download('punkt')from nltk.stem import WordNetLemmatizerlemmatizer=WordNetLemmatizer()w_tokenizer=nltk.tokenize.WhitespaceTokenizer()def lemmatize_text(text): return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]tweets['lematized_tokens']=tweets['text'].apply(lemmatize_text)tweets.head()
'''outputtext Sentiment lematized_tokens0 menyrbie philgahan chrisitv httpstcoifz9fan2pa... Neutral [menyrbie, philgahan, chrisitv, httpstcoifz9fa...1 advice talk neighbour family exchange phone nu... Positive [advice, talk, neighbour, family, exchange, ph...2 coronavirus australia woolworth give elderly d... Positive [coronavirus, australia, woolworth, give, elde...3 food stock one empty please dont panic enough ... Positive [food, stock, one, empty, please, dont, panic,...4 ready go supermarket covid19 outbreak im paran... Extremely Negative [ready, go, supermarket, covid19, outbreak, im...
Now it is time for some Text exemplary Analysis, we will start by viewing the word length of our text
tweets['word_length']=tweets['text'].str.split().apply(len)tweets.head()'''outputtext Sentiment lematized_tokens word_length0 menyrbie philgahan chrisitv httpstcoifz9fan2pa... Neutral [menyrbie, philgahan, chrisitv, httpstcoifz9fa... 61 advice talk neighbour family exchange phone nu... Positive [advice, talk, neighbour, family, exchange, ph... 272 coronavirus australia woolworth give elderly d... Positive [coronavirus, australia, woolworth, give, elde... 133 food stock one empty please dont panic enough ... Positive [food, stock, one, empty, please, dont, panic,... 244 ready go supermarket covid19 outbreak im paran... Extremely Negative [ready, go, supermarket, covid19, outbreak, im... 24
You can use the word length to visualize the distribution of sentiment against the sentiments,ill skip that one for now and go on to visualize the most frequent words used
from collections import Counterimport matplotlib.pyplot as plt# we first Flatten the list of lematized tokensall_tokens = [token for sublist in tweets['lematized_tokens'] for token in sublist]# Count the frequency of each wordword_counts = Counter(all_tokens)# Get the most common wordsmost_common_words = word_counts.most_common(20) # Show the top 20# Extract words and counts for plottingwords, counts = zip(*most_common_words)# Create a bar chartplt.figure(figsize=(10, 6))plt.bar(words, counts)plt.xlabel("Words")plt.ylabel("Frequency")plt.title("Most Frequent Words")plt.xticks(rotation=45, ha='right')plt.tight_layout()plt.show()
Let's look at Wordcloud
What’s a Word Cloud?
Imagine if your text data was a party, and each word was a guest. The more often a word appears, the bigger and more prominent it becomes in the crowd. That’s a word cloud in essence. It’s a visual depiction of word frequency where:
Size Matters: Bigger words are used more often.
Color: Sometimes words are color-coded for additional information or just for aesthetics.
Orientation: Words can be placed in various orientations, creating a chaotic yet beautiful mess.
let’s look at our text data wordcloud
from wordcloud import WordCloudwordcloud = WordCloud(width=800, height=400, stopwords=stop_words, min_font_size=10, background_color='white').generate(' '.join(all_tokens))plt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()
We can wrap up our text pre-processing by generating a wordcloud for text sentiment
import matplotlib.pyplot as plt# Create a new DataFrame with only the 'text' and 'Sentiment' columnssentiment_df = tweets[['text', 'Sentiment']]# Group the data by sentiment and concatenate the textsentiment_text = sentiment_df.groupby('Sentiment')['text'].apply(lambda x: ' '.join(x))# Create word clouds for each sentimentfor sentiment in sentiment_text.index: text = sentiment_text[sentiment] wordcloud = WordCloud(width=1200, height=800, stopwords=stop_words, min_font_size=10, background_color='white').generate(text) plt.figure(figsize=(10, 6)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title(f'Word Cloud for Sentiment: {sentiment}') plt.show()
I’ll wrap up for now, I do hope that you now have a clear grasp on text pre-processing.
About Writer