HDSC August ’21 Capstone Project Presentation: Fake News Classification Web App

A Project by Team Apache

Problem Brief

Fake News A Real Problem The plague eating up social media

The destructive and catastrophic import of fake news can not be overemphasised and utterly underestimated. Though fake news starts subtly and goes unnoticeable in the early stages, when allowed to breed, birth violent outcomes which are capable of instigating social/political wars, and having negative psychological effect on the individuals they’re targeted at. Social media platforms are especially being used today to dish out misinformation at lightning speed. One thing we can do is to avoid the news altogether, which seems nearly impossible, or one can utilize tools such as those of machine learning to fight the plague of fake news — This is the intent of this project.

Project Scope And Boundary

Kaggle’s fake news twitter dataset was used for this analysis.


The news niche focuses on political news in the united states

The news article examined in the dataset is 2 years old.

Project Aim

The objective of this article is to outline end-to-end steps on building and training a machine learning model to classify fake and true news using the best performing algorithm and deploying this model using Streamlit.

The dataset source is from Kaggle : Fake News dataset from the InClass Prediction Competition. All notebooks and scripts used can be found on apache GitHub repo apache-21 (https://github.com/apache 21/fake_news_detection). This article will illustrate 7 steps which are outlined in the project workflow section.

1. Data Source : Data gotten from kaggle.com

2. Data Preprocessing of text

a. Exploratory Data Analysis

b. Data cleaning and feature engineering

c. Visualization

3. Model Selection and Evaluation

4. Data Pipeline

5. Model Deployment

6. Consolidation and Discussion

Data Preprocessing

Library use for project development:

Project Methodology

Preparing the dataset

The dataset from Kaggle is provided in two CSV files which are already classified between true and fake news. The dataset was loaded using the pandas library however since it is textual data we carried out data cleaning, pre-processing, EDA and model-building operations.

Below is an overview of the dataset

Examining the Class and the Subject of the news Content in the dataset

The fake news data set and genuine news were merged together and the visualization of the class and subject to the category of news in the dataset with the respective frequency is shown below:


The domain subject appears to be heavy on political news

Performing Name Entity Recognition on the data

Named Entity Recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories.

Name entity recognition was carried out on the text data to extract key names and entities present in the dataset and the below is the visualization

Feature Engineering

In order to draw more insight from the dataset, new features were engineered such as;

Transforming text data into numerical values

Machine learning algorithms thrive on numerical values, hence, libraries such as the countvectorizer, bag of words model were used to achieve the numerical transformation.

A helper function get_top_n_words is defined for this numerical transformation and to get the top words with visualization.

Visualization for the most frequent title in the data

Observation from the above visualization:

Based on the comparison between the top 10 frequent words in titles and news text, we can infer that both fake and true news is dominated by news relating to politics, and more specifically, the subject being heavily related to American politics is shared between true and fake news. This would result in the model being biased to classifying news that relates only to American Politics and probably of that time frame. To mitigate this bias, more recent data and diverse news data would be needed.

Model Selection and Data Pipeline

The dataset was splitted into test and train sets, and a helper function was created to remove unwanted patterns in the dataset. This is then passed to through a model through another helper function that both transforms and preprocesses the text into numerical values to then make predictions.

List of Classifier Models Use

Classical machine learning algorithms were utilized for this classifier; a deep learning model lstm was also used.

Naive bayes — multinomial

Logistics regression

Random forest classifier

Gradient boosting classifier

Lstm — Deep learning model

Below is the helper function — a data pipeline that takes the dataset or tweet, preprocesses it, and feeds it into the model.

Data Pipeline which Entails Transforming the Data to Numerical Data with Removal of Irrelevant Patterns Present.

Model Evaluation

The performance of the model on the validation is examined using the evaluation metrics below:

confusion matrix

Accuracy score



f1 score

Training a Deep neural network(LSTM) on the Dataset

The dataset was first preprocessed for the neural network, after which it was trained using the lstm models The models were evaluated, and the result is shown in the table below:

The random forest seems to outperform other classifiers, thus, it will be considered the choice algorithm to use in the deployment phase.

However, it will be interesting to check the performance of the deep learning model (lstm) on the dataset.

Picking the Boss Model and Saving the Model Using Pickle

The random forest seems to be the boss, hence, it is chosen and the model is saved using pickle. This can then be used for future prediction.

Model Deployment

The fake news classification app is now deployed on the web using streamlit which is readily available to end users

Fake news classification source code link : apache-21

(https://github.com/apache21/fake_news_detection/blob/main/fake_new_detection_app/fake-news 5.ipynb)

Our mission is to develop an army of creative problem solvers using an innovative approach to internships.