Predictive Maintenance of Industrial Equipment
TEAM LSTM, HDSC ’25 SPECIAL COHORT
TEAM MEMBERS: Ali Aliyu, Godson Osiozele Gabriel, Oluwatimileyin Oyelumade and Sakinah Abolude
ABSTRACT
This research focuses on forecasting equipment health and failure timelines through a comprehensive predictive maintenance framework. Leveraging time-series sensor data from industrial systems, it applies advanced feature engineering techniques including rolling statistics and frequency domain analysis to capture degradation patterns. With robust machine learning models like Random Forest and Gradient Boosting, and deep learning architectures such as LSTM and BiLSTM, the study explores the relationships between multivariate sensor inputs and remaining useful life (RUL). Model evaluation employs cross-validation and metrics like RMSE and R² to assess prediction fidelity. The approach reveals key indicators influencing asset deterioration, offering insights for maintenance teams and engineers. By enhancing predictive accuracy, the framework supports data-driven decision-making, reduces downtime, and contributes to more sustainable operations in industrial environments.
INTRODUCTION
Predictive maintenance has emerged as a critical strategy in industrial operations, minimizing unscheduled downtime and optimizing asset performance. Traditional reactive approaches often fail to capture early signs of equipment failure, leading to costly repairs and production losses. This study delves into the complex dynamics of equipment health, aiming to anticipate failures before they occur using machine learning and deep learning techniques.
Industrial systems are monitored through a network of sensors capturing operational parameters across time. However, extracting actionable insights from such high-dimensional, noisy data poses significant challenges. This work addresses those challenges by engineering features that reflect long-term degradation behavior, such as rolling aggregates and FFT-derived frequency components. It then applies a suite of predictive models to estimate the RUL of equipment units under varying operational conditions.
The goal is to build a scalable framework for predictive maintenance that enables engineers to prioritize repairs, schedule maintenance proactively, and extend asset life. By uncovering the underlying patterns in sensor data and validating models on realistic test sets, the study contributes to the broader goal of intelligent asset management in industrial ecosystems.
LITERATURE REVIEW
Predictive maintenance has gained significant traction across manufacturing, aerospace, and energy sectors as a proactive strategy for equipment health management. Numerous studies emphasize the value of forecasting Remaining Useful Life (RUL) by analyzing sensor-based time-series data.
Early approaches relied on statistical techniques and reliability theory to model wear and failure, such as Weibull analysis and condition-based thresholding. However, these methods struggled with the complexity and volume of modern sensor data.
The introduction of machine learning (ML) marked a shift in predictive maintenance. Random Forests and Gradient Boosting Machines proved effective in handling tabular sensor readings and ranking feature importance (Zhao et al., 2017). These models, while interpretable and fast, lacked the ability to model temporal dependencies, a critical aspect of equipment degradation.
Consequently, deep learning became the focal point in recent literature. Long Short-Term Memory (LSTM) networks are widely adopted due to their capacity to capture long-range dependencies in multivariate sequences (Shao et al., 2020). Enhancements such as Bidirectional LSTMs, GRUs, and Attention mechanisms have demonstrated improvements in RUL prediction by amplifying the model’s sensitivity to directional trends and anomaly signals.
Other works explore hybrid architectures, like CNN-LSTM models, to extract localized patterns before learning sequential dynamics. These models are particularly useful in vibration signal processing and frequency-domain analysis, often leveraging FFT-transformed inputs.
Datasets such as CMAPSS, released by NASA, are extensively used for benchmarking. Researchers employ feature engineering techniques — rolling statistics, frequency components, sensor fusion — to transform raw measurements into meaningful predictive indicators. Ensemble methods and cross-validation strategies are frequently applied to assess model robustness and generalization.
Despite algorithmic advances, challenges remain in handling variable operational settings, imbalanced failure distributions, and real-time deployment at scale. Recent research trends point toward self-supervised learning, transfer learning across equipment types, and reinforcement learning for maintenance scheduling.
Overall, literature reflects a steady evolution from basic rule-based triggers to intelligent, data-driven frameworks that can anticipate equipment failures with high precision and adaptability.
METHODOLOGY
DATA COLLECTION AND PREPARATION
Sensor data was sourced from publicly available predictive maintenance datasets, including the CMAPSS dataset developed by NASA and other benchmark industrial repositories. These datasets comprise multi-sensor time-series recordings for various engine units over operational cycles, capturing parameters such as temperature, pressure, vibration, and flow rates.
To streamline the analysis and ensure relevance to real-world industrial scenarios, we focused specifically on units with complete life-cycle histories, enabling precise calculation of Remaining Useful Life (RUL). Data augmentation involved integrating failure labels and matching metadata across multiple files, such as condition settings and component specifications.
Raw data underwent a rigorous preparation process:
- Null value handling: No missing readings were tolerated for time-series integrity.
- Rolling statistics: Applied moving averages, standard deviations, and windowed max/min calculations to highlight temporal degradation behavior.
- Frequency domain transformations: Used Fast Fourier Transform (FFT) to extract dominant vibration signatures and uncover hidden failure patterns.
- Data fusion: Sensor readings were synchronized by engine cycle to maintain coherence and facilitate sequence-based modeling.
- Normalization: Feature scaling was applied using Min Max or Standard Scaler to ensure compatibility across sensors.
To model temporal dependencies, the dataset was reshaped into overlapping sequences using a sliding window technique, with a configurable sequence-length to feed deep learning architectures such as LSTM and BiLSTM. Additionally, categorical identifiers like engine-id were retained to group predictions and evaluate performance per machine instance.
The cleaned and engineered dataset contains thousands of sequences representing various stages of component health, establishing a robust foundation for predictive modeling and residual life forecasting.
Approach 1: RUL For Predictive Maintenance with FFT
This approach presents a hybrid framework combining statistical and frequency-domain feature engineering, including rolling window techniques and Fast Fourier Transform (FFT), with advanced deep learning methods such as Bidirectional LSTM networks. Using the CMAPSS FD001 dataset, we demonstrate the approach in estimating Remaining Useful Life (RUL) for aircraft engines. Our results were quite unexpected as we expected more from the LSTM model compared to our baseline models.
1. Dataset and Preprocessing
- Data Used: CMAPSS FD001
- 100 engines in training
- 100 engines in test set
- 21 sensors + 3 operational settings
Preprocessing Steps:
- Removal of non-informative columns (setting1, sensor1, etc.)
- Calculated RUL as max_cycle — current_cycle
2. Feature Engineering
2.1 Rolling Window Statistics
For each selected sensor, the following features were extracted using a window size of 30 cycles:
- Mean, Standard Deviation, Min, Max
2.2 Frequency-Domain Features
FFT was used to extract dominant frequency features from vibration-related sensors. This highlighted hidden signal characteristics beyond time domain trends.
3. Feature Scaling and Data Preparation
All features were normalized using Min-Max scaling to improve model convergence and stability. Sequence data was prepared in 30-cycle slices per engine for LSTM ingestion.
4. Baseline Models
Two models were trained for comparison:
- Linear Regression (LR)
- Random Forest Regressor (RF)
The Linear Regression model performed poorly on this dataset after all the feature engineering for a few key reasons:
- Non-linear relationship: Remaining Useful Life (RUL) in this dataset likely does not have a simple linear relationship with the sensor readings and operational settings. As an engine degrades, the changes in sensor data are often non-linear. Linear Regression is designed to model linear relationships, so it struggles to capture these complexities.
- Time-series nature: Linear Regression doesn’t inherently account for the sequential and temporal dependencies within the data. The state of an engine at a given cycle depends on its history. Ignoring this temporal aspect can lead to poor predictions, especially as the RUL decreases and the degradation accelerates.
- Feature Engineering limitations for LR: While you did some feature engineering (rolling statistics and FFT), Linear Regression is less able to leverage these types of features compared to models like Random Forest or LSTMs, which can intrinsically capture more complex interactions and patterns.
- Evaluation Metric Sensitivity: The R2 score is particularly sensitive to how well the model explains the variance in the data. A large negative R2 score indicates that the model is performing significantly worse than simply predicting the mean of the target variable. This further emphasizes that the linear assumption was not appropriate for this problem.
Essentially, Linear Regression is too simple a model to capture the complex, non-linear, and temporal patterns present in this engine degradation dataset after the feature engineering.
Meanwhile, the Random Forest model most likely performed well for the following reasons:
- Non-linear relationships: Random Forests are ensemble models based on decision trees. Decision trees can capture non-linear relationships and interactions between features without requiring explicit feature engineering for these non-linearities. This is a significant advantage over Linear Regression for this dataset.
- Handling complex feature interactions: The model can implicitly learn complex interactions between the various sensor readings and operational settings, as well as the engineered rolling features and FFT features. This ability to combine information from different features helps in making more accurate predictions of RUL.
- Robustness to outliers and noise: Random Forests are generally less sensitive to outliers and noisy data compared to Linear Regression, which can be beneficial in real-world sensor data that might contain some anomalies.
- Ensembling effect: By combining predictions from multiple decision trees (the “forest”), the model reduces variance and improves generalization compared to a single decision tree.
While Random Forest doesn’t explicitly model the time series sequence like an LSTM, the rolling statistical features we engineered helped provide some temporal context to the model. By including the mean, standard deviation, min, and max over a window, we gave the Random Forest information about the recent history of the sensor readings, which is crucial for RUL prediction.
In summary, the Random Forest’s ability to handle non-linearities, complex feature interactions, and its inherent robustness make it a strong candidate for this type of regression problem, especially when combined with relevant engineered features that capture temporal aspects of the data like we did.
5. LSTM Architecture
An LSTM model was constructed using:
- 50 epochs Span and batch size of 54
- Dense output predicting RUL from a 30-cycle sequence
From the image of RUL prediction using the split validation dataset, we can see that the plots suggest that the regular LSTM model, with the current architecture and training, had difficulty accurately predicting RUL, particularly in capturing the degradation trend and the specific RUL value at the end of an engine’s life in the test set. The lag and overprediction in the validation set, and the scatter in the test set, visually support the quantitative evaluation metrics (high RMSE and negative R2) we observed for this model. This further reinforces the reasons discussed subsequently about the model’s limitations in capturing the complex temporal dynamics and non-linear degradation in this dataset.
The LSTM model we ran resulted in a Test RMSE of 70.30 and a Test R2 Score of -1.86. Here are some reasons why the regular LSTM might have yielded these results:
- Vanishing Gradient Problem: While LSTMs are designed to mitigate vanishing gradients, they can still struggle with very long sequences or complex dependencies over time. A standard LSTM might not effectively capture the crucial early degradation signals or trends that influence RUL later in an engine’s life.
- Unidirectional Processing: A regular LSTM processes the sequence data in only one direction (from past to future). In some time series problems, considering information from both the past and the “future” within a sequence can be beneficial. The Bidirectional LSTM addresses this by processing sequences in both directions.
- Feature Impact within LSTM: While you included many features, the standard LSTM might not be effectively leveraging all of them or their interactions within the sequence context compared to other models or a more complex LSTM architecture.
The negative R2 score (-1.86) indicates that the regular LSTM model’s predictions were worse than simply predicting the average RUL for all test engines. This suggests that the model struggled to find meaningful patterns in the sequences that consistently improved upon a simple baseline prediction on this specific test set.
6. Deep Learning: Bidirectional LSTM Architecture
A hybrid BiLSTM model was constructed using:
- 2 LSTM layers with dropout
- Dense output predicting RUL from a 30-cycle sequence
Training spanned 50 epochs and showed continuous improvement in validation loss.
look at the plots for the Bidirectional LSTM model: the “Validation — RUL Prediction (Bidirectional LSTM)” plot and the “Test Set — RUL Prediction (Bidirectional LSTM)” plot .
Validation Set Plot:
Compared to the regular LSTM validation plot, the Bidirectional LSTM’s predictions on the validation set appear to follow the true RUL trend somewhat more closely.
There still seems to be some lag and overprediction, especially at lower RUL values, but the overall fit seems slightly better than the unidirectional LSTM.
The predictions might be a bit smoother in some sections compared to the regular LSTM.
Test Set Plot:
The test set plot for the Bidirectional LSTM also compares the true RUL for the last cycle of each test engine with the model’s predictions.
Like the validation set, the predictions on the test set seem to be somewhat closer to the true RUL values compared to the regular LSTM’s test set predictions.
While there is still scatter, the points show a slightly better general trend towards the ideal red line compared to the regular LSTM’s test plot.
Overall:
The visual improvements in both the validation and test set plots for the Bidirectional LSTM, although not drastic, align with the quantitative evaluation metrics we obtained. The Bidirectional LSTM had a lower Test RMSE (59.58) and a less negative R2 score (-1.10) compared to the regular LSTM (Test RMSE: 70.30, R2 Score: -1.86).
This suggests that allowing the model to process the sequences in both forward and backward directions helped it capture more relevant information and slightly improve its predictive capability. However, the negative R2 score on the test set still indicates that even the Bidirectional LSTM performed worse than a simple mean baseline for predicting the RUL at the last cycle of the test engines.
This implied that while the bidirectional LSTM is better at capturing temporal dependencies within the sequences, the specific challenge of predicting the precise RUL at the very end of an engine’s life for unseen test data remains significant with this model architecture and feature set. The Random Forest model, which doesn’t rely on sequences in the same way but leverages the engineered rolling features and FFT features on a per-cycle basis, seems to have been more effective in capturing the patterns relevant to the final RUL in this specific evaluation setup.
Approach 2: RUL For Predictive Maintenance with Hypotheses Check-out
This approach explored traditional and deep learning-based regression approaches and tested 3 hypotheses regarding frequency-domain features. The hypotheses include:
- H1: FFT features from sensor 12 reduce RMSE by ≥10% over baseline LSTM.
- H2: A CNN-LSTM hybrid outperforms a pure LSTM model with RMSE < 20 cycles on FD001.
- H3: Model trained on early-life cycles provides a warning 30% earlier than midpoint-cycle trained models.
NOTE: Compared to the previous approach, there was no prior feature engineering or FFT features added to the data set. There was only data cleaning and normalizing before training the models used.
Dataset Description
We used the FD001 subset of the CMAPSS dataset, which contains readings from 100 engines across 26 features (operational settings + sensor values).
Sensors analyzed: sensor1 through sensor21 (after dropping noisy signals)
Data Preprocessing
The preprocessing steps included:
- Normalizing sensor readings using Min-Max-Scaler
- Computing RUL as the difference between max cycle per engine and current cycle
- Dropping uninformative features such as setting1', ‘setting2’, ‘setting3’, ‘sensor1’, ‘sensor5’, ‘sensor6’, ‘sensor10’, ‘sensor16’, ‘sensor18’, ‘sensor19’ etc.
- Visualizing sensor degradation trends vs. RUL for multiple engines
Baseline Models
We implemented 2 models for our baseline models:
- Linear Regression (LR)
- Random Forest Regression (RF)
Evaluation on the last cycle of each engine yielded the following:
Methodology
Three models were considered for the hypotheses check-out and they include Linear Regression, Random Forest, and LSTM. Similar steps for the hypotheses check-out for each model were utilized and they include Data Preparation, Model Initialization and Training, Prediction on Test Data, Evaluation Data Selection, and Performance Evaluation for Linear Regression and Random Forest while the hypothesis check-out steps for LSTM include Data Preparation and Sequence Creation, Train-Validation Split, Model Architecture, Model Compilation, Model Training, Prediction on Test Data, and Performance Evaluation.
Results and Discussion
The performance metric of each model is shown in the table below likewise the plots of each model.
In the context of predictive maintenance, an RMSE of around 32 cycles suggests that the Linear Regression model can provide a general estimate of RUL, but the errors are quite substantial, which might limit its practical application for precise failure prediction or scheduling maintenance activities. The R2 score further reinforces that a simple linear model may not fully capture the complex, non-linear degradation patterns inherent in aircraft engine data. This performance serves as a baseline to compare against more complex models like Random Forest and LSTM, which are better equipped to handle non-linear relationships and temporal dependencies.
Despite being a more complex, non-linear model, the Random Forest implementation in this approach did not outperform the simpler Linear Regression model in terms of RMSE and R2 score on this specific test set. This could be attributed to various factors, such as the chosen hyperparameters for the Random Forest model, the nature of the features used, or the inherent characteristics of the FD001 dataset. While Random Forests are powerful and often capture non-linear relationships effectively, their performance is dependent on proper tuning and data suitability. In this approach, for the evaluation metric focused on the last cycle’s prediction, it appears the ensemble of decision trees did not yield superior accuracy compared to the linear approach. This highlights the importance of empirical evaluation when selecting models for a specific task and dataset.
The superior performance of the LSTM model, as evidenced by the lower RMSE and higher R2 score, is likely attributable to its ability to learn from sequential data. By processing sequences of sensor readings over time, the LSTM can identify temporal dependencies and degradation trends that are crucial for accurate RUL prediction but might be missed by models that treat each data point independently (like Linear Regression and Random Forest applied per cycle). This makes LSTM models particularly well-suited for time-series forecasting tasks in predictive maintenance.
Hypothesis Evaluation Summary
This study investigated three hypotheses regarding Remaining Useful Life (RUL) prediction for aircraft engines using the FD001 dataset:
Hypothesis 1: FFT features from sensor 12 reduce RMSE by ≥10% over baseline LSTM.
Models: Baseline LSTM vs. LSTM + FFT (Sensor 12).
Performance:
LSTM: RMSE = 20.87, R2 = 0.75
LSTM + FFT: RMSE = 20.43, R2 = 0.76
Conclusion: Hypothesis not supported. Adding FFT features from sensor 12 resulted in a minor RMSE reduction (approx. 2.1%), not meeting the 10% threshold.
Hypothesis 2: A CNN-LSTM hybrid outperforms a pure LSTM model with RMSE < 20 cycles on FD001.
Models: Pure LSTM vs. CNN-LSTM Hybrid.
Performance:
LSTM: RMSE = 20.87, R2 = 0.75
CNN-LSTM: RMSE = 23.94, R2 = 0.67
Conclusion: Hypothesis not supported. The CNN-LSTM hybrid did not outperform the pure LSTM and neither model achieved an RMSE below 20 cycles.
Hypothesis 3: Model trained on early-life cycles provides a warning 30% earlier than midpoint-cycle trained models.
Models: LSTM trained on Early-Life Cycles vs. LSTM trained on Midpoint Cycles.
Performance & Analysis:
Mean Lead (Early-Life Model): NaN cycles
Mean Lead (Midpoint-Life Model): -18.43 cycles
Conclusion: Hypothesis could not be definitively supported or rejected due to the inability to calculate a meaningful mean lead time for the early-life model with the chosen parameters. The midpoint model showed a negative mean lead time, indicating late warnings.
Conclusion
Among the evaluated models and feature engineering techniques, the pure LSTM model demonstrated the best performance on the test set based on RMSE and R2 score. The addition of FFT features from sensor 12 provided a slight improvement but did not meet the hypothesized threshold. The CNN-LSTM hybrid did not outperform the pure LSTM. The hypothesis regarding early vs. midpoint cycle training could not be conclusively tested with the current warning definition and parameters. The pure LSTM model’s ability to process sequential data appears to be a key factor in its superior performance for this RUL prediction task.