Enhancing Breast Cancer Diagnosis Through Clustering: A Study of KMeans, Agglomerative, and Gaussian Mixture Models


Authors : Nikhil Sanjay Suryawanshi

Volume/Issue : Volume 8 - 2023, Issue 7 - July

Google Scholar : https://tinyurl.com/5x5p9nau

Scribd : https://tinyurl.com/yk5v23xz

DOI : https://doi.org/10.38124/ijisrt/IJISRT23JUL2308

Abstract : In the evolving landscape of medical data analysis, clustering techniques play a pivotal role, particularly in deciphering intricate patterns within datasets, such as those linked to cancer diagnostics. With the continuous expansion and increasing complexity of healthcare data, there is a growing demand for effective clustering algorithms capable of extracting significant insights. Current trends underscore the necessity of carefully selecting the most appropriate clustering method to improve both the accuracy and interpretability of analytical results. In this paper, we conduct a comprehensive comparison of three prominent clustering algorithms - KMeans, Agglomerative Clustering, and Gaussian Mixture Model (GMM) - applied to a breast cancer dataset comprising features from Fine Needle Aspirates (FNA) of breast masses. Following a thorough preprocessing and scaling of the features, we assess the performance of these clustering techniques using the Silhouette Score, Calinski-Harabasz Score, and Davies- Bouldin Score. The findings reveal that KMeans provides superior cluster separation and clarity relative to the other algorithms. This research emphasizes the critical role of algorithm selection based on specific dataset attributes and evaluation metrics, aiming to enhance the accuracy of clustering outcomes in breast cancer classification.

Keywords : Clustering, Breast Cancer, KMeans, Agglomerative Clustering, Gaussian Mixture Model(GMM), Fine Needle Aspirates (FNA), Silhouette Score, Calinski- Harabasz Score, Davies-Bouldin Score, Medical Data Analysis, Deep Embedded Clustering.

References :

  1. Liu, Y., Zhang, T., & Chen, W. (2022). A deep learning-based clustering framework for high-dimensional and noisy big data. IEEE Transactions on Knowledge and Data Engineering, 34(8), 3645-3659.
  2. Wang, Y., Saraswat, S. K., & Komari, I. E. (2023). Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach. Journal of King Saud University - Computer and Information Sciences, 35(1), 270–282.
  3. Patel, R., Gupta, S., & Patel, H. (2022). Scalable big data clustering using advanced machine learning models on Apache Spark. Big Data Research, 27, 100
  4. Xie, J., et al. (2016). Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning (pp. 478-487).
  5. Yang, J., et al. (2017). Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In International Conference on Machine Learning (pp. 3861-3870).
  6. Guan, Y., et al. (2021). Deep discriminative clustering analysis. In International Conference on Machine Learning (pp. 3864-3875).
  7. Bahmani, B., et al. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
  8. He, Y., et al. (2011). MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems (pp. 473-480).
  9. Campello, R. J., et al. (2022). Scalable density-based clustering: A data mining perspective. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3), 1-38.
  10. Strehl, A., & Ghosh, J. (2002). Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(Dec), 583-617.
  11. Kashef, R., & Kamel, M. S. (2009). Cooperative clustering. Pattern Recognition, 42(10), 2324-2349.
  12. Peng, X., et al. (2017). Deep clustering via integrating sparse subspace clustering analysis and deep representation. Pattern Recognition Letters, 98, 74-83.
  13. Zhu, X., et al. (2023). Hybrid Meta-Clustering: A meta-learning approach to clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 1098-1113.
  14. Aggarwal, C. C., et al. (1999). PROCLUS: A technique for projective clustering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 94-104).
  15. Agrawal, R., et al. (1998). Automatic subspace clustering of high dimensional data for data mining applications (Vol. 27, No. 2, pp. 94-107). ACM.
  16. Moise, G., et al. (2009). HARP: Hybrid Approximate Recursive Partitioning for Clustering High-Dimensional Data. In Proceedings of the 2009 IEEE International Conference on Data Mining (pp. 878-883).
  17. Vidal, R., et al. (2022). Subspace-Constrained Clustering: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 8351-8366.
  18. Xu, X., et al. (2007). SCAN: A structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 824-833).
  19. Staudt, C. L., et al. (2016). Parallel clustering on big data. Computational Statistics & Data Analysis, 101, 52-67.
  20. Wang, Y., et al. (2023). Graph Convolutional Clustering: A Deep Learning Approach to Graph Clustering. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (pp. 861-869).
  21. Rendón, E., et al. (2011). Internal versus external cluster validation indexes. International Journal of Computers and Communications, 5(1), 27-34.
  22. Sips, M., et al. (2009). Interactive visual clustering. In Proceedings of the 14th International Conference on Information Visualisation (pp. 361-368).
  23. Kriegel, H. P., et al. (2022). Cluster Validation Techniques for Big Data. ACM Computing Surveys (CSUR), 55(1), 1-38.
  24. Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2019). Distributed Deep Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9489-9498).
  25. Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). Deep Adversarial Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4991-5000).
  26. Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., & Tagiku, B. (2017). Streaming k-means on well-clusterable data. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 26-40).
  27. Shah, V., & Mitra, K. (2019). Online Deep Clustering. In Proceedings of the IEEE International Conference on Data Mining (pp. 1011-1016).
  28. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
  29. Dai, B. R., & Lin, I. C. (2012). Efficient mapreduce-based DBSCAN algorithm with optimized data partition. In Proceedings of the IEEE 5th International Conference on Cloud Computing (pp. 59-66).
  30. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (pp. 577-584).
  31. Shu, R., Chen, Y., Kumar, A., Ermon, S., & Poole, B. (2020). Unsupervised Disentangled Representation Learning For Interpretable Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 5792-5799).
  32. Srinivasan, B. V., & Orhobor, O. I. (2022). Explainable Clustering: Understanding and Explaining Cluster Structures. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 10451-10459).
  33. Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478-487).
  34. Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017). Variational deep embedding: An unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1965-1972).
  35. Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 132-149).
  36. Zhao, H., Ding, Z., & Fu, Y. (2017). Multi-view clustering of high-dimensional data using kernel-based co-regularized spectral clustering. Knowledge-Based Systems, 123, 84-97.
  37. Kulkarni, S., & Shaikh, A. (2019). Ontology-driven clustering with biological knowledge for gene expression data analysis. Bioinformatics, 35(14), 2480-2488.
  38. Sarkar, S., & Viswanath, P. (2019). Differentially private clustering using subspace approximation. In Proceedings of the 19th International Conference on Data Mining (pp. 1060-1065).
  39. Hu, P., & Liang, S. (2021). Multi-modal clustering: A survey. Neurocomputing, 456, 260-276.
  40. Briggs, C., Fan, Z., & Andras, P. (2020). Federated machine learning for wireless distributed computing resource allocation. IEEE Transactions on Cognitive Communications and Networking, 6(4), 1193-1206.

In the evolving landscape of medical data analysis, clustering techniques play a pivotal role, particularly in deciphering intricate patterns within datasets, such as those linked to cancer diagnostics. With the continuous expansion and increasing complexity of healthcare data, there is a growing demand for effective clustering algorithms capable of extracting significant insights. Current trends underscore the necessity of carefully selecting the most appropriate clustering method to improve both the accuracy and interpretability of analytical results. In this paper, we conduct a comprehensive comparison of three prominent clustering algorithms - KMeans, Agglomerative Clustering, and Gaussian Mixture Model (GMM) - applied to a breast cancer dataset comprising features from Fine Needle Aspirates (FNA) of breast masses. Following a thorough preprocessing and scaling of the features, we assess the performance of these clustering techniques using the Silhouette Score, Calinski-Harabasz Score, and Davies- Bouldin Score. The findings reveal that KMeans provides superior cluster separation and clarity relative to the other algorithms. This research emphasizes the critical role of algorithm selection based on specific dataset attributes and evaluation metrics, aiming to enhance the accuracy of clustering outcomes in breast cancer classification.

Keywords : Clustering, Breast Cancer, KMeans, Agglomerative Clustering, Gaussian Mixture Model(GMM), Fine Needle Aspirates (FNA), Silhouette Score, Calinski- Harabasz Score, Davies-Bouldin Score, Medical Data Analysis, Deep Embedded Clustering.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe