Data Science Capstone projects batch #20 (Zurich) and #1 (Munich)

by Ekaterina Butyugina

Data-science-students-working
We are delighted to present the remarkable projects executed by our Data Science students from batches #20 in Zurich and #1 in Munich. Despite their brief enrollment of three months, these proficient students have achieved noteworthy results. We invite you to continue reading to discover more about their outstanding accomplishments.

Rolos: separating signal from noise in car racing communication

Students: Kai Hugenroth, Markus von der Luehe, Mounika Veeramalla, Kai Braeunig

Rolos aims to help scientists increase productivity and achieve exponential progress in research projects by leveraging the power of big data to address researchers' most common pain points and provide them with access to advanced technologies. The company's customer, NASCAR, has 45 radio channels for communication between drivers and engineers. However, monitoring all the channels is a resource-consuming task. Thus, an automated real-time radio analytics service is needed to provide better insight.

Mounika, Kai B., Kai H. and Markus set out to classify text messages and identify key messages in the radio communications of auto racing teams. Their system's goal is to help engineers focus on high-priority messages by distinguishing important information from noise. 

Rolos' and the students' approach involves recording real-time radio communications, audio transcription of every single message, and using NLP machine and deep learning techniques to classify and identify key messages. The result is a highly efficient system that ensures engineers focus only on the essential messages that can affect the race. This system can correctly identify 93% of the messages as high or low importance. 75% Of all high importance messages were identified! See the picture below. 


Message_priority_Rolos


The benefits of this system are numerous. It increases safety by analyzing real-time messages, provides high performance in classifying fast random inputs, and reduces labor costs. The system provides more accurate information to race teams, enabling them to make informed decisions that can improve their performance.


Results_and_future_implementation_ROLOS


Future plans include collecting human feedback on the data, implementing a live scenario for real-time optimization, and collecting more learnable data to improve the system's accuracy.

 

SIX: The Merchant Master - A Data Extraction and Mining Tool

Students: Daphne Teh, Ziba Mirza, Ronja Harder 

The average person makes thousands of decisions every day. These decisions are often made intuitively and with limited information regardless of their importance. In the information age, however, decision makers have the potential to amass an abundance of data. This can empower them to make strategic, consequential decisions, but only if they have the necessary tools to do so. 

The team was given the mandate to do precisely this for SIX Group. SIX connects financial markets in Switzerland and abroad, providing services related to its four business units: stock exchanges, securities services, banking services, and financial information data provider. As part of its goal to provide end-to-end information services to banks, SIX challenged the team to extract merchant names from a dataset records, mine information, and categorize the merchants.

Given the quality of the raw data and the fact that most of these merchants were small and medium-sized enterprises, the project would be considered a success if the team managed to:
  1. Automate the extraction, categorization, and mining processes.
  2. Obtain metadata for 20% of the merchants in the list.
The team embarked on a three-pronged strategy and far exceeded expectations. This culminated in:
  1. 91.6% of the merchant names were successfully extracted.
  2. 54.5% of the merchant names yielding metadata. 
  3. 66.3% of the merchants being categorized.
  4. One application that streamlined and automated the extraction, mining, and categorization processes and that could be utilized for future datasets with different data points.
In order to achieve these results, the team leveraged a variety of tools including Natural Language Processing models, metadata platforms, and APIs. 

The first prong involved creating a processor to remove noise from the dataset. 

The second prong leveraged Google, Open Street Maps, and the official trade registry of the Swiss government to mine data. These sources were each selected to obtain consumer relevant data, geospatial data, and official information that could further be used for verification purposes. The results can be seen in the picture below where you can find Data Mining Sources with Success Rate


First_second_thrid_prong


The third prong used the output from various APIs for language detection and translation combined with a curated list of context-specific brand names to aggregate and assign categories to the merchant names. 
Finally, the team built an app on Streamlit that integrated these processes and provided SIX with additional analytics by creating visualizations such as a map show casing the merchants’ location:


Location_of_activity


Through these steps, the team transformed a list of records into a highly enriched dataset and created a tool that could systematically produce clean high-quality data on little-known companies and generate valuable insights. 

 

NeatLeaf: image-based anomaly detection for an agricultural application

Students: Agustin Rojo Serrano, Michael Schlathölter, Jonas von Kobylinski, Cuong Huy Nguyen   

NeatLeaf helps indoor and greenhouse cultivators to increase their yield by closely supervising the plants with advanced technology and AI. The robot Spyder drives around the cultivation facility and takes images of the plants 3 times a day. For crop training, an AI algorithm analyzes the images, looking for plant anomalies and flagging them when detected. Early detection of plant stress is crucial to fight pests or other diseases, increasing the productivity of facilities through less crop failure and less severe incidents. Every image used to train the AI model goes through a labeling process, where experts will note if they have any diseases, resulting in a lot of labor and costs for the company. Therefore, image augmentation is used to artificially increase the database.

Michael, Jonas, Agustin and Coung had access to more than 13k images, an existing AI model in Pytorch to modify, and some GPUs accessible through an SSH tunnel to do experiments. The goal was to explore the effects of noise on model performance.

Their approach involved looking at the different suggested noise types, as well as some other available quality reduction techniques. Before applying them randomly to the images, proper limits needed to be set, as too much noise can hide the images’ features. As the frequencies of the anomalies are not evenly distributed, they balanced the data set by adding some images with rare labels multiple times.


Unbalanced_dataset


The f1-score increased slightly from 81.7% to 82.2%. By using SHAP gradients to visualize the labeling process, it can be seen that the noise-based models take into account larger areas when labeling, which makes them more confident on their choices.


Anomaly_Detection_for_yellowing


All in all the AI is quite reliable in detecting anomalies, being as small as the yellowing tips. They are helping a human workforce, even if not replacing completely yet, and are able to cover large areas. 

 

ReoR20: Precipitation prediction using rain gauges and remote sensing

Students: Elizaveta Lakimenko, Lucas Pahlisch, Marco Ferrari, Alexej Khalilzada

Changes in precipitation patterns caused by climate change and increased urbanization in flood-prone areas have recently increased the frequency and severity of floods in many parts of the world [1].

ReoR20 is building the next generation of flood prediction models to help stakeholders better manage the risk of flood disasters at high spatial and temporal resolution. To achieve this, they are always in need of better input data about weather input to feed their model. 

The objective of this project was to create a machine learning model that could predict rainfall occurring in any watershed of the conterminous United States from rain gauge data. Elizaveta, Alexej, Lucas, and Marco were provided the rain gauge measurements for days spanning from 2010 to 2021 and a map of the watersheds in which they were located. They were also given geographic information describing the watershed and the location of the rain gauges. Using this information, they were asked to predict radar precipitation measurements, which are considered a more reliable source of precipitation data.

After trying several approaches, they converged on a model that uses aggregated statistical information about each day for each watershed in the US. Additionally, each watershed was divided into 5 altitude zones, and for each zone the same aggregated statistical information has been calculated. This information was used as input to train an ensemble machine-learning model.





The team produced seven different models, based on the major watershed basins in the U.S. and the performance of the model was different for each one. On average an R2 score of 0.55 was achieved. Overall, the model can predict well when a rain event will occur, but the magnitude is where the model could be improved.


Rain_gauge_data_for_one_catchment:_target_diff


[1] IPCC Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation (eds Field, C. B. et al.) (Cambridge Univ. Press, 2012).

Thank you everybody for a fantastic partnership and an amazing project period! We at Constructor Academy wish our Data Science grads the best of luck.



Take your career to the next level with Constructor Academy's Data Science bootcamp?

Are you interested in a highly demanding, well-respected, and financially rewarding career? Look no further than the Data Science bootcamp offered by Constructor Academy.

The goal of the bootcamp is to teach you the techniques and technologies for processing real-world data, and it is offered in both full-time (12 weeks) and part-time (22 weeks) options. During the bootcamp, you will learn technologies such as machine learning, natural language processing (NLP), Python, deep learning, data visualization, and R.

Or start your journey with our free introduction to data science. Just click here to learn more.

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.

Read more
Blog