HDSC Winter ’22 Premiere Project Presentation: Air Quality in Madrid (2001–2018)

HamoyeHQ
4 min readMar 3, 2022

A Project by Team RNN

Introduction

Authorities in Madrid, Spain have been forced to take critical measures in combating the continuous deterioration of air quality in the city. One of such measures is the prohibition of cars in the city centre.

This dataset, collected over 18 years, can be useful in answering critical questions about the causes, effects, and possible solutions to air pollution in Madrid and other parts of the world.

This dataset contains in a practical format 18 years (2001–2018) of hourly data in just a single file, which makes it a great playground for time series analysis and other prediction tasks.

Objectives

The project aims at giving answers to the following questions: How do different gases correlate with their levels? Are there any changes in trends? Can they be mapped to the recent decisions made by the city council, or do they relate to rainy dates? What is the best model to predict pollution levels? How do the levels interpolate between the location of the stations? Are some gases more common at different elevations?

Our Approach

Data profiling and cleaning, exploratory data analysis, and K-Means clustering techniques for inferences were all carried out.

Dataset Used for Analysis

The dataset contains information about air quality in Madrid (2001–2018) collected from Kaggle website taking into consideration different pollution levels in Madrid from (2001 to 2018).

Collaborators

This project is an open-source project for the Hamoye Data Science Internship. We are a team of data scientists, data storytellers, and data engineers; each team member was assigned a specific role.

Dataset Exploration

As with all real-life data, the dataset contained some missing values and imperfections which resulted in the cleaning of the data. The missing values in columns of interest are CO, NMHC, NO_2, NOx, O_3, PM10, SO_2, TCH, station, NO which were filled with zeros (0) due to the unavailability of exact information.

Data Cleaning

The following strategy shall be used to handle our missing data;

  1. Missing values in other features will be replaced with respect to stations with their median values. This is to preserve the statistical property of each feature as much as possible. Median is used here instead of the mean because the mean can be affected by the presence of outliers
  2. Left-over Missing values will be replaced with respect to month with their mean values.

AQI Visualization of the variables

Visualizing the station dataset

Countplot for Elevation

Boxplot for Longitude

Scatterplot showing the relationship between Latitude and Elevation

Scatterplot showing the relationship between Longitude and Elevation

From the visualization shown on the dataset above, the following could be inferred.

There is a weak positive correlation between the latitude and the elevation (they tend to rise together) meaning that areas with high latitude can be characterized by air quality such as lower oxygen, strong winds, frigid temperatures, etc

The above statement is also true for the longitude but the relationship between latitude and elevation is stronger than that of longitude and elevation.

Air Quality Index Prediction

Conclusion and Recommendation

Our linear regression model prediction showed a rise and fall in the Air Quality Index, all above 100. Research has shown that an Air Quality Index above 100 is unhealthy. As a result of this, the Government should enact laws to mitigate activities responsible for releasing these gasses into the atmosphere.

--

--

HamoyeHQ

Our mission is to develop an army of creative problem solvers using an innovative approach to internships.