A COMPREHENSIVE STUDY OF K-MEANS AND K-NEAREST NEIGHBORS ALGORITHMS IN BIG DATA CONTEXT
Abstract
In the present era of information processing, a huge amount of data is being produced by businesses, the web and social media, government and non-government organizations, and other sources in real time. The analysis of this huge collection of data, popularly known as Big Data, is still a challenging task to produce more accurate and relevant results in an information retrieval system, especially when the velocity is high. Efficient and scalable data processing techniques and platforms are still needed to handle the data efficiently and to retrieve the relevant information more accurately from this large collection of data available in structured and unstructured formats in real-time (such as the data generated in situations of a pandemic like COVID-19). Various machine learning platforms are established, each with its unique features to manipulate the Big Data concerning volume, velocity, and variety of data. The common concern of these platforms is the computational speed and the ability to manipulate the amount of data. There is a need to have a comprehensive analysis of machine learning platforms to facilitate new researchers and application developers to develop new applications to process the data efficiently in a big data environment. K-Means clustering and K-Nearest Neighbors (KNN) classification are studied on some specific datasets. Effective analysis of these algorithms is performed in this research in terms of seven features, including: major contribution, performance measures, effectiveness measures, big data environment, parallel & distributed processing, tools used and evaluation metrics. The results are produced very carefully by thoroughly analyzing the K-Means and KNN schemes. Overall, this study provides an insight into recent trends in the big data environment using K-Means and KNN algorithms













