Fake News A Real Problem — The plague eating up social media
The destructive and catastrophic import of fake news can not be overemphasised and utterly underestimated. Though fake news starts subtly and goes unnoticeable in the early stages, when allowed to breed, birth violent outcomes which are capable of instigating social/political wars, and having negative psychological effect on the individuals they’re targeted at. Social media platforms are especially being used today to dish out misinformation at lightning speed. One thing we can do is to avoid the news altogether, which seems nearly impossible, or one can utilize tools such as those of machine learning to fight the plague of fake news — This is the intent of this project.
Project Scope And Boundary
Kaggle’s fake news twitter dataset was used for this analysis.
The news niche focuses on political news in the united states
The news article examined in the dataset is 2 years old.
The objective of this article is to outline end-to-end steps on building and training a machine learning model to classify fake and true news using the best performing algorithm and deploying this model using Streamlit.
The dataset source is from Kaggle : Fake News dataset from the InClass Prediction Competition. All notebooks and scripts used can be found on apache GitHub repo apache-21 (https://github.com/apache 21/fake_news_detection). This article will illustrate 7 steps which are outlined in the project workflow section.
1. Data Source : Data gotten from kaggle.com
2. Data Preprocessing of text
a. Exploratory Data Analysis
b. Data cleaning and feature engineering
3. Model Selection and Evaluation
4. Data Pipeline
5. Model Deployment
6. Consolidation and Discussion
Library use for project development:
- Pandas for data analysis
- numpy for numerical computation
- matplotlib for visualisation
- spacy for information extraction to perform such as (NER, POS tagging, dependency parsing, word
- vectors nltk for text preprocessing, converting text into numbers for the model.
- Seaborn for visualization
- textblob for text preprocessing, such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation
- re library for to search and find patterns in the tweets
- wordcloud which is a visualization technique for text data wherein each word is picturized with its importance in the context or its frequency.
- pickle to save the model and access it
Preparing the dataset
The dataset from Kaggle is provided in two CSV files which are already classified between true and fake news. The dataset was loaded using the pandas library however since it is textual data we carried out data cleaning, pre-processing, EDA and model-building operations.
Below is an overview of the dataset
Examining the Class and the Subject of the news Content in the dataset
The fake news data set and genuine news were merged together and the visualization of the class and subject to the category of news in the dataset with the respective frequency is shown below:
The domain subject appears to be heavy on political news
Performing Name Entity Recognition on the data
Named Entity Recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories.
Name entity recognition was carried out on the text data to extract key names and entities present in the dataset and the below is the visualization
In order to draw more insight from the dataset, new features were engineered such as;
Transforming text data into numerical values
Machine learning algorithms thrive on numerical values, hence, libraries such as the countvectorizer, bag of words model were used to achieve the numerical transformation.
A helper function get_top_n_words is defined for this numerical transformation and to get the top words with visualization.
Observation from the above visualization:
Based on the comparison between the top 10 frequent words in titles and news text, we can infer that both fake and true news is dominated by news relating to politics, and more specifically, the subject being heavily related to American politics is shared between true and fake news. This would result in the model being biased to classifying news that relates only to American Politics and probably of that time frame. To mitigate this bias, more recent data and diverse news data would be needed.
Model Selection and Data Pipeline
The dataset was splitted into test and train sets, and a helper function was created to remove unwanted patterns in the dataset. This is then passed to through a model through another helper function that both transforms and preprocesses the text into numerical values to then make predictions.
List of Classifier Models Use
Classical machine learning algorithms were utilized for this classifier; a deep learning model lstm was also used.
Naive bayes — multinomial
Random forest classifier
Gradient boosting classifier
Lstm — Deep learning model
Below is the helper function — a data pipeline that takes the dataset or tweet, preprocesses it, and feeds it into the model.
Data Pipeline which Entails Transforming the Data to Numerical Data with Removal of Irrelevant Patterns Present.
The performance of the model on the validation is examined using the evaluation metrics below:
Training a Deep neural network(LSTM) on the Dataset
The dataset was first preprocessed for the neural network, after which it was trained using the lstm models The models were evaluated, and the result is shown in the table below:
The random forest seems to outperform other classifiers, thus, it will be considered the choice algorithm to use in the deployment phase.
However, it will be interesting to check the performance of the deep learning model (lstm) on the dataset.
Picking the Boss Model and Saving the Model Using Pickle
The random forest seems to be the boss, hence, it is chosen and the model is saved using pickle. This can then be used for future prediction.
The fake news classification app is now deployed on the web using streamlit which is readily available to end users
Fake news classification source code link : apache-21