AUTOMATED IDENTIFICATION OF BUILD DISCUSSIONS ON MICROSERVICES SYSTEMS: AN EMPIRICAL STUDY

Ayesha Anjum; Muhammad Nasir; Zakia Jalil

Authors

Ayesha Anjum
Muhammad Nasir
Zakia Jalil

Keywords:

Microservices Architecture, Build Discussions, Machine Learning

Abstract

In recent years, microservices architecture has gained widespread popularity over traditional systems, largely due to its flexible development cycle and enhanced scalability. In software development, software quality is a major concern. Issues within the software can significantly impact its overall quality. Microservices developers face several challenges in monitoring and managing issues such as failures, faults, and errors. These challenges often arise from a lack of evidence and understanding, hindering the effective implementation of quality practices in Microservices Architecture (MSA). The process of converting source code into an executable file is known as a build, while any problems that occur during this process are referred to as build issues. In the current literature, the methods available for identifying build issues in microservices are primarily qualitative or rely on manual research approaches. Methods such as thematic analysis (TA) and grounded theory (GT) can be challenging to manage due to their complexity and time-consuming nature. To address this gap, we identify build-related discussions within the existing microservices based systems. We define a build discussion as a developer conversation that addresses challenges and decisions related to the build process, typically presented in the form of paragraphs. This research focuses on identifying build issues, often stemming from poor management and dependencies, using machine learning (ML) techniques. We applied ML and deep learning (DL) models to a manually curated dataset consisting of project discussions and annotations. The results identified 11,663 non-build discussions and 1,997 build-related discussions. The ML models, evaluated using k-fold cross-validation, achieved the following performance metrics: Precision 83.60%, Recall 72.34%, F-score 77.79%, AUC 80.44%, and G-Means 68.55%. Among the three baseline models, DeepM1 performed the best. The validation survey further confirmed that build discussions identified through DeepM1 are beneficial in practice.