Data Science capstone projects #19

Ekaterina Butyugina

data-science-final-projects
The following blog post features data science projects from full-time students who have completed their program. Here are the incredible results they've achieved in so little time.
 

SIX: payment transaction volumes forecasting

Students: Alžbeta Bohiniková, Luis Miguel Rodríguez Sedano, Mukund Pondkule, Michael Flury

SIX connects financial markets in Switzerland and abroad, providing services related to its four business units: stock exchanges, securities services, banking services, and financial information data provider.

The goal of the project was to build a fully automated pipeline to forecast future transactions while analyzing the business context behind the provided dataset.

Forecasting transaction volumes is crucial for SIX, as their payment processing revenue model is based on transaction fees earned each time a merchant receives a payment. This varies depending on the service that SIX is performing (i.e. direct transaction fees for standard payments, increased 3-D secure checkout, and fraud check).

From the historical data of unique merchants, the students created a pipeline that outputs the total transaction volumes for each month. This information is used to train three different time series models. Seasonal Autoregressive Integrated Moving Average (SARIMA), Exponential Triple Smoothing (ETS), and Prophet (forecasting model by Facebook). The best model is subsequently selected and used to forecast future transaction levels. Ultimately, forecast results are stored as new monthly data comes in. The pipeline will then automatically compare them to the previously forecasted transaction volumes.



graph



This chart shows ETS model within the pipeline that forecasts transaction volumes (blue line on the right) for the next year based on historical data over the last 26 months.

Through automatic monitoring, the pipeline created by Alzbeta, Luis, Mukund and Michael will assess existing models and evaluate their performance as new data is collected. Moreover, the algorithm could also be used to create new models as previous prediction models become subpar for future business decisions.


 

Rolos by Constructor: sport team fan base social media analytics

Students: Naemi Graf, Joana Duarte, Mihaela Cucui

Rolos by Constructor is a company specializing in Machine Intelligence consulting services in driverless mobility, robotics, and professional sports. They wanted to better understand how teams could utilize their social media presence to become more successful in growing and engaging their supporters.

The goals of the project included: 
  • Assess the social media presence of a football team
  • Segment their fan base 
  • Provide recommendations on how to grow and engage with their fans



Naemi, Joana and Mihaela selected the Manchester United team as their social media presence is strong. They selected Twitter as the social media platform, due to the easily accessible data, which is available through an API. Though they also explored obtaining data from other social media platforms, they were unsuccessful, due to API accessibility issues or a lack of useful information. Nonetheless, the group managed to collect tweets and some useful information through the Twitter API.

Manchester United has almost 34M followers on Twitter. However, it proved to be impossible to collect information from all users during the timeline available. Hence, the students focused on looking at the tweets, rather than going directly for the user info. Using the ManUnited hashtags Naemi, Joana and Mihaela were able to collect approximately 1.4M tweets for a period of six months. Approximately 138 thousand unique users generated these tweets. 
Both these tweets and users were the object of the analysis.

Using snscrape, Tweepy and Twython combined with the information collected they were able to perform text pre-processing and wrangling. They then performed sentiment analysis using TextBlob, Topic Modelling and User Segmentation with BERTopic. Students explored other NLP techniques such as LDA and word2vec that proved less successful.



graph_topic_modelling



The graph above shows different topics created with the BERTopic model.  For example, on Aug 22, 2022, there was a game with Liverpool and the most spike in social media was about this match. 

By using a variety of models and tools, the students were able to extract some interesting insights about the fanbase. They also discovered topics and words that were trending and generated the greatest engagement. Their model and work process can be used to perform similar analyses, specifically for social media.


 

Fluence: detecting performance anomalies in wind turbines

Students: Alexander Tsibizov, Eva Polakova, Jamison Proctor, Stefan Schultze

Fluence, a global leader in energy storage and digital applications for renewable energy, offers a wind power plant monitoring platform. This platform helps wind power plant operators understand plant performance and identify opportunities for improvement. The ability to automatically detect deviations from the normal operation of turbines in wind power plants is one of the most sought-after features of such a platform. However, understanding what “normal” operation is, can be especially challenging, as not all wind turbines, nor all wind power plant locations, are created equal. What might be normal operation in one location or turbine, might represent a less-than-optimal operation in another. Alexander, Eva, Jamison, and Stefan took the challenge to identify anomalies in wind turbine performance using wind turbine operational data produced during a single year.

The team began to derive the normal operation for each turbine using the data provided. They created several methods for reaching this goal based on both static analysis of the historical turbines’ data, as well as dynamically defining normal behavior over time. These approaches gave a baseline definition of “normal” upon which it was possible to build methods for anomaly detection. Nevertheless, these definitions were based on a subset of data, which could be “normal” but cannot be guaranteed. 

Machine learning-based approaches were also investigated, as they specialize in defining anomalies without understanding normal. Of them, a combination of isolation forest and K-means clustering showed promise. The isolation forest identified anomalous data, then the anomalous data were clustered into two groups using K-means clustering. This approach helped identify anomalies where there was a clear, sustained change in operation mode. Unfortunately, this method struggled to recognize anomalies during erratic operations or if there were multiple failure modes present in the same dataset.

Alexander, Jamison, Eva, and Stefan were also challenged to move beyond binary anomaly classification. Wind power plant operators are especially interested in differentiating diverse types of anomalies. This information helps operators to intervene quickly and precisely. So, the students created approaches to identify two anomalous operation modes, curtailment, and curve shift. 

Curtailment occurs when the output of the turbine is manually reduced to better align with the power grid demand. Whereas curve shifts indicate that the wind speed sensor on the turbine is malfunctioning, making it seem as though the power output is either too high or too low at any given wind speed. Both anomaly types have distinct relationships with normal operation, and they were able to create rules which would detect when these anomalies when present.







In the image above the curve, the shift can be seen on the left and the curtailment on the right.

The team was pleased with their contribution to this topic and hopes that their work will add value to Fluence and their company's mission.


 

LEDCity: sensor calibration, evaluating temperature and humidity data

Students: Anita Gaj, Avinash Chekuru, Ling Yee Khor  

LEDCity is a startup company that offers unique lighting products that reduce energy costs by up to 90% compared to traditional lamps and up to 50% compared to conventional LEDs. This is accomplished by incorporating a series of sensors and a microcontroller into the lamps so that they can self-regulate their brightness in a decentralized manner. Additionally, a variety of humidity, temperature, and occupancy data is recorded so that customers can make informed decisions about cleaning, product storage, and office use, as well as receive alerts to unusual events such as flooding or fire accidents. However, the operation of the lighting itself generates heat that affects sensor readings, thus, measurements are inaccurate.

The goal of the project is to develop a data science pipeline to clean and analyze data from the humidity and temperature sensors in the lamps. Then integrate a machine learning model to accurately predict readings within a target of 1C° of error, while correcting for any self-interference.

Several analysis tools were integrated into the workflow: 
  • performing exploratory data analysis to process
  • clean time-series data from experiments
  • detect anomalies
  • identify potential systematic errors


Using a correlation matrix for the machine learning model, the key features were selected and redundant ones, which could skew results, were omitted. As a result, four different supervised machine learning models were developed, based on PyCaret and traditional regression modelling approaches. These can be used to replicate the experiment or allow the customer to predict the actual temperature and humidity data. 

Finally, the model was integrated with a user-friendly Streamlit app. This allowed customers to predict temperature and humidity for both individual and many lamps at once.



comparison-predictions



The graph above shows a comparison between the real measurements and the predictions of our machine learning model for test and random unseen data with robust performance at 0.2 degrees Celsius deviation. 

To summarize:
  • Anita, Avinash, and Ling Yee developed a data science pipeline to process lamp sensor data efficiently   
  • The robust supervised machine learning model predicts actual temperature and humidity data with 0.2 degrees Celsius deviation
  • The easy-to-use Streamlit app interface allows customers to predict true values remotely 


Subsequently, the overall productivity of the lamps was improved, and overall costs were reduced, making the product more environmentally friendly. With the ever-increasing global energy demand, limited resources, and the current global energy crisis, it is crucial to find innovative solutions or improve existing ones. This AI-based digital solution enhanced the overall productivity of lamps and plays a significant role in the world we live in today.
 

Thank you all for a fantastic time and a great project phase! Constructor Learning wishes all our Data Science graduates the best for their future.

Interested in reading more about Constructor Learning and tech related topics? Then check out our other blog posts.

Read more
Blog