"Filtering the Noise: A Comprehensive Study on Identifying and Categorizing Offensive Language in Social Media"
Authors:
Karina Karira
Simran Lahrani
Yashraj Mulwani
Under the guidance of our mentor: Mrs.Vidya Zope
Introduction
Social media has revolutionized communication, enabling individuals to connect and share information like never before. However, this digital revolution has also brought forth a troubling phenomenon: the widespread use of offensive language. Offensive language, encompassing hate speech, personal attacks, and harassment, can have significant negative impacts on individuals and communities. It fuels division, perpetuates harmful stereotypes, and, in some cases, leads to real-world harm.
In this study, we delve into the impact of offensive language on social media and explore strategies for identifying and categorizing such content. Understanding these dynamics is crucial for creating more inclusive, supportive, and respectful digital spaces.
Abstract
In this project, we apply logistic regression to a natural language processing (NLP) task. We begin by extracting and transforming text data into numerical features, using methods like TF-IDF or word embeddings. Logistic regression is implemented, either from scratch or with a machine learning library. We select an NLP task, prepare a labeled dataset, and split it into training and testing sets.
The model is trained on the training data and evaluated using various metrics. An error analysis is performed to pinpoint areas where the model makes mistakes, enabling us to fine-tune the model for improved performance. This project sheds light on the application of logistic regression in NLP and the significance of error analysis in model development.
Methodology applied
As a part of data cleaning and preprocessing, we followed the below steps:
Text Preprocessing: Text preprocessing is the initial step. It involves cleaning and standardizing the text data. Common preprocessing steps include converting text to lowercase and removing special characters, punctuation, and HTML tags.
Tokenization: Tokenization splits the text into individual words or tokens. This step converts the text into a format suitable for analysis by breaking it down into its constituent elements.
Stopword Removal: Common stopwords (e.g., "the," "and," "is") are often removed from the text. Stopwords don't typically carry significant information and can be excluded to focus on more meaningful terms.
Feature Extraction (TF-IDF Vectorization): Feature extraction is a crucial step in NLP. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is a common technique used to convert tokens into numerical features. TF-IDF assigns weights to words based on their importance within a specific document and across the entire dataset.
The dataset used for this purpose can be obtained with the following code:
import nltk
nltk.download('twitter_samples')
The tweets are processed before being fed to the algorithm. The code for processing the tweets can be given as:
# this function takes a tweet as input and performs a series of text preprocessing steps, including tokenization, removal of common stopwords,
# punctuation, and other specific elements commonly found in tweets (e.g., hashtags, retweet indicators). It also stems the remaining words
# before returning them in a list. This can be useful when preparing text data for natural language processing tasks, such as sentiment analysis
# or text classification.
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
def process_tweet(tweet):
"""Process tweet function.
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# remove stock market tickers like $GE
tweet = re.sub(r'\$\w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
return tweets_clean
Algorithms used
In this project, we have focused on classifying whether the impact conveyed by the tweet is positive or negative. The classification algorithm that we used to do the same is Logistic Regression.
Logistic Regression
source: Logistic regression sigmoid representation
Implementing a Logistic regression requires the following functions to be implemented:
Sigmoid Function (Logistic Function): The sigmoid function maps real-valued numbers to a range between 0 and 1, representing probabilities. It is the core of logistic regression, modeling the likelihood that a given input belongs to a particular class.
Cost Function (Log-Likelihood): The cost function, often referred to as log-likelihood, quantifies the error between predicted probabilities and actual labels. Minimizing this cost is the primary objective in logistic regression training.
Gradient Descent: Gradient descent is employed to iteratively update the model's parameters (theta) by computing the gradient of the cost function. This process fine-tunes the model for better accuracy.
We can implement Logistic regression and the Gradient Descent algorithm for categorising the tweets. In this blog, we have implemented Logistic Regression.
The samples are split into training and testing samples
def sigmoid(z):
'''
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
'''
h = 1/(1+np.exp(-z))
return h
Testing the model
The above model can be tested using the predict_tweet function. This function gives the probability of the category to which the tweet belongs to.
def predict_tweet(tweet, freqs, theta):
'''
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
'''
# extract the features of the tweet and store it into x
x = extract_features(tweet, freqs)
# make the prediction using x and theta
y_pred = sigmoid(np.dot(x, theta))
return y_pred
Adding few lines in the above code help us to determine whether the tweet is positive or nagetive.The code can be given as:
my_tweet = 'Awful experience. I would never buy this product again!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
print('Positive sentiment')
else:
print('Negative sentiment')
The output for the above code is:
['aw', 'experi', 'would', 'never', 'buy', 'product'] [[0.49718744]]
Negative sentiment
Similarly,we check for a positive statement to test the model:
my_tweet = 'He was amazed by his performance.'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
print('Positive sentiment')
else:
print('Negative sentiment')
The output for the above code is:
['amaz', 'perform'] [[0.50299493]] Positive sentiment
Computing the accuracy of the model
def test_logistic_regression(test_x, test_y, freqs, theta):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""
# the list for storing predictions
y_hat = []
for tweet in test_x:
# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)
if y_pred > 0.5:
# append 1.0 to the list
y_hat.append(1.0)
else:
# append 0 to the list
y_hat.append(0.0)
accuracy = np.sum(np.asarray(y_hat) == np.squeeze(test_y))/test_y.shape[0]
return accuracy
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")
The model gave an accuracy of 99.50%, i.e. it is able to correctly identify the category of the tweet.
Conclusion
In this project, we explored how the Natural Language Processing libraries are used to determine the category of a tweet. Tweet categorization can be a valuable tool for improving the user experience, enhancing content delivery, sentiment analysis and gaining insights into online conversations and behaviors.