"Filtering the Noise: A Comprehensive Study on Identifying and Categorizing Offensive Language in Social Media"

Authors:

  • Karina Karira

  • Simran Lahrani

  • Yashraj Mulwani

Under the guidance of our mentor: Mrs.Vidya Zope

Introduction

Social media has revolutionized communication, enabling individuals to connect and share information like never before. However, this digital revolution has also brought forth a troubling phenomenon: the widespread use of offensive language. Offensive language, encompassing hate speech, personal attacks, and harassment, can have significant negative impacts on individuals and communities. It fuels division, perpetuates harmful stereotypes, and, in some cases, leads to real-world harm.

In this study, we delve into the impact of offensive language on social media and explore strategies for identifying and categorizing such content. Understanding these dynamics is crucial for creating more inclusive, supportive, and respectful digital spaces.

Abstract

In this project, we apply logistic regression to a natural language processing (NLP) task. We begin by extracting and transforming text data into numerical features, using methods like TF-IDF or word embeddings. Logistic regression is implemented, either from scratch or with a machine learning library. We select an NLP task, prepare a labeled dataset, and split it into training and testing sets.

The model is trained on the training data and evaluated using various metrics. An error analysis is performed to pinpoint areas where the model makes mistakes, enabling us to fine-tune the model for improved performance. This project sheds light on the application of logistic regression in NLP and the significance of error analysis in model development.

Methodology applied

As a part of data cleaning and preprocessing, we followed the below steps:

Text Preprocessing: Text preprocessing is the initial step. It involves cleaning and standardizing the text data. Common preprocessing steps include converting text to lowercase and removing special characters, punctuation, and HTML tags.

Tokenization: Tokenization splits the text into individual words or tokens. This step converts the text into a format suitable for analysis by breaking it down into its constituent elements.

Stopword Removal: Common stopwords (e.g., "the," "and," "is") are often removed from the text. Stopwords don't typically carry significant information and can be excluded to focus on more meaningful terms.

Feature Extraction (TF-IDF Vectorization): Feature extraction is a crucial step in NLP. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is a common technique used to convert tokens into numerical features. TF-IDF assigns weights to words based on their importance within a specific document and across the entire dataset.

The dataset used for this purpose can be obtained with the following code:

import nltk
nltk.download('twitter_samples')

The tweets are processed before being fed to the algorithm. The code for processing the tweets can be given as:

# this function takes a tweet as input and performs a series of text preprocessing steps, including tokenization, removal of common stopwords,
# punctuation, and other specific elements commonly found in tweets (e.g., hashtags, retweet indicators). It also stems the remaining words
# before returning them in a list. This can be useful when preparing text data for natural language processing tasks, such as sentiment analysis
# or text classification.
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

Algorithms used

In this project, we have focused on classifying whether the impact conveyed by the tweet is positive or negative. The classification algorithm that we used to do the same is Logistic Regression.

Logistic Regression
is a fundamental machine learning algorithm used for binary classification tasks. It models the probability of an input belonging to one of two classes (typically 0 or 1). The algorithm leverages the sigmoid function and gradient descent for training.
sigmoid function | logistic regression

source: Logistic regression sigmoid representation

Implementing a Logistic regression requires the following functions to be implemented:

  • Sigmoid Function (Logistic Function): The sigmoid function maps real-valued numbers to a range between 0 and 1, representing probabilities. It is the core of logistic regression, modeling the likelihood that a given input belongs to a particular class.

  • Cost Function (Log-Likelihood): The cost function, often referred to as log-likelihood, quantifies the error between predicted probabilities and actual labels. Minimizing this cost is the primary objective in logistic regression training.

  • Gradient Descent: Gradient descent is employed to iteratively update the model's parameters (theta) by computing the gradient of the cost function. This process fine-tunes the model for better accuracy.

We can implement Logistic regression and the Gradient Descent algorithm for categorising the tweets. In this blog, we have implemented Logistic Regression.

The samples are split into training and testing samples

def sigmoid(z):
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    h = 1/(1+np.exp(-z))
    return h

Testing the model

The above model can be tested using the predict_tweet function. This function gives the probability of the category to which the tweet belongs to.


def predict_tweet(tweet, freqs, theta):
    '''
    Input:
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output:
        y_pred: the probability of a tweet being positive or negative
    '''
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)

    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    return y_pred

Adding few lines in the above code help us to determine whether the tweet is positive or nagetive.The code can be given as:

my_tweet = 'Awful experience. I would never buy this product again!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else:
    print('Negative sentiment')

The output for the above code is:

['aw', 'experi', 'would', 'never', 'buy', 'product'] [[0.49718744]] Negative sentiment

Similarly,we check for a positive statement to test the model:

my_tweet = 'He was amazed by his performance.'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else:
    print('Negative sentiment')

The output for the above code is:

['amaz', 'perform'] [[0.50299493]] Positive sentiment

Computing the accuracy of the model


def test_logistic_regression(test_x, test_y, freqs, theta):
    """
    Input:
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output:
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    # the list for storing predictions
    y_hat = []
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0.0)
    accuracy = np.sum(np.asarray(y_hat) == np.squeeze(test_y))/test_y.shape[0]
    return accuracy
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

The model gave an accuracy of 99.50%, i.e. it is able to correctly identify the category of the tweet.

Conclusion

In this project, we explored how the Natural Language Processing libraries are used to determine the category of a tweet. Tweet categorization can be a valuable tool for improving the user experience, enhancing content delivery, sentiment analysis and gaining insights into online conversations and behaviors.