Authors :
Nikhil Sanjay Suryawanshi
Volume/Issue :
Volume 8 - 2023, Issue 7 - July
Google Scholar :
https://tinyurl.com/5x5p9nau
Scribd :
https://tinyurl.com/yk5v23xz
DOI :
https://doi.org/10.38124/ijisrt/IJISRT23JUL2308
Abstract :
In the evolving landscape of medical data
analysis, clustering techniques play a pivotal role,
particularly in deciphering intricate patterns within
datasets, such as those linked to cancer diagnostics. With
the continuous expansion and increasing complexity of
healthcare data, there is a growing demand for effective
clustering algorithms capable of extracting significant
insights. Current trends underscore the necessity of
carefully selecting the most appropriate clustering method
to improve both the accuracy and interpretability of
analytical results. In this paper, we conduct a
comprehensive comparison of three prominent clustering
algorithms - KMeans, Agglomerative Clustering, and
Gaussian Mixture Model (GMM) - applied to a breast
cancer dataset comprising features from Fine Needle
Aspirates (FNA) of breast masses. Following a thorough
preprocessing and scaling of the features, we assess the
performance of these clustering techniques using the
Silhouette Score, Calinski-Harabasz Score, and Davies-
Bouldin Score. The findings reveal that KMeans provides
superior cluster separation and clarity relative to the other
algorithms. This research emphasizes the critical role of
algorithm selection based on specific dataset attributes and
evaluation metrics, aiming to enhance the accuracy of
clustering outcomes in breast cancer classification.
Keywords :
Clustering, Breast Cancer, KMeans, Agglomerative Clustering, Gaussian Mixture Model(GMM), Fine Needle Aspirates (FNA), Silhouette Score, Calinski- Harabasz Score, Davies-Bouldin Score, Medical Data Analysis, Deep Embedded Clustering.
References :
- Liu, Y., Zhang, T., & Chen, W. (2022). A deep learning-based clustering framework for high-dimensional and noisy big data. IEEE Transactions on Knowledge and Data Engineering, 34(8), 3645-3659.
- Wang, Y., Saraswat, S. K., & Komari, I. E. (2023). Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach. Journal of King Saud University - Computer and Information Sciences, 35(1), 270–282.
- Patel, R., Gupta, S., & Patel, H. (2022). Scalable big data clustering using advanced machine learning models on Apache Spark. Big Data Research, 27, 100
- Xie, J., et al. (2016). Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning (pp. 478-487).
- Yang, J., et al. (2017). Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In International Conference on Machine Learning (pp. 3861-3870).
- Guan, Y., et al. (2021). Deep discriminative clustering analysis. In International Conference on Machine Learning (pp. 3864-3875).
- Bahmani, B., et al. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
- He, Y., et al. (2011). MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems (pp. 473-480).
- Campello, R. J., et al. (2022). Scalable density-based clustering: A data mining perspective. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3), 1-38.
- Strehl, A., & Ghosh, J. (2002). Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(Dec), 583-617.
- Kashef, R., & Kamel, M. S. (2009). Cooperative clustering. Pattern Recognition, 42(10), 2324-2349.
- Peng, X., et al. (2017). Deep clustering via integrating sparse subspace clustering analysis and deep representation. Pattern Recognition Letters, 98, 74-83.
- Zhu, X., et al. (2023). Hybrid Meta-Clustering: A meta-learning approach to clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 1098-1113.
- Aggarwal, C. C., et al. (1999). PROCLUS: A technique for projective clustering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 94-104).
- Agrawal, R., et al. (1998). Automatic subspace clustering of high dimensional data for data mining applications (Vol. 27, No. 2, pp. 94-107). ACM.
- Moise, G., et al. (2009). HARP: Hybrid Approximate Recursive Partitioning for Clustering High-Dimensional Data. In Proceedings of the 2009 IEEE International Conference on Data Mining (pp. 878-883).
- Vidal, R., et al. (2022). Subspace-Constrained Clustering: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 8351-8366.
- Xu, X., et al. (2007). SCAN: A structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 824-833).
- Staudt, C. L., et al. (2016). Parallel clustering on big data. Computational Statistics & Data Analysis, 101, 52-67.
- Wang, Y., et al. (2023). Graph Convolutional Clustering: A Deep Learning Approach to Graph Clustering. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (pp. 861-869).
- Rendón, E., et al. (2011). Internal versus external cluster validation indexes. International Journal of Computers and Communications, 5(1), 27-34.
- Sips, M., et al. (2009). Interactive visual clustering. In Proceedings of the 14th International Conference on Information Visualisation (pp. 361-368).
- Kriegel, H. P., et al. (2022). Cluster Validation Techniques for Big Data. ACM Computing Surveys (CSUR), 55(1), 1-38.
- Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2019). Distributed Deep Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9489-9498).
- Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). Deep Adversarial Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4991-5000).
- Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., & Tagiku, B. (2017). Streaming k-means on well-clusterable data. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 26-40).
- Shah, V., & Mitra, K. (2019). Online Deep Clustering. In Proceedings of the IEEE International Conference on Data Mining (pp. 1011-1016).
- Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
- Dai, B. R., & Lin, I. C. (2012). Efficient mapreduce-based DBSCAN algorithm with optimized data partition. In Proceedings of the IEEE 5th International Conference on Cloud Computing (pp. 59-66).
- Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (pp. 577-584).
- Shu, R., Chen, Y., Kumar, A., Ermon, S., & Poole, B. (2020). Unsupervised Disentangled Representation Learning For Interpretable Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 5792-5799).
- Srinivasan, B. V., & Orhobor, O. I. (2022). Explainable Clustering: Understanding and Explaining Cluster Structures. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 10451-10459).
- Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478-487).
- Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017). Variational deep embedding: An unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1965-1972).
- Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 132-149).
- Zhao, H., Ding, Z., & Fu, Y. (2017). Multi-view clustering of high-dimensional data using kernel-based co-regularized spectral clustering. Knowledge-Based Systems, 123, 84-97.
- Kulkarni, S., & Shaikh, A. (2019). Ontology-driven clustering with biological knowledge for gene expression data analysis. Bioinformatics, 35(14), 2480-2488.
- Sarkar, S., & Viswanath, P. (2019). Differentially private clustering using subspace approximation. In Proceedings of the 19th International Conference on Data Mining (pp. 1060-1065).
- Hu, P., & Liang, S. (2021). Multi-modal clustering: A survey. Neurocomputing, 456, 260-276.
- Briggs, C., Fan, Z., & Andras, P. (2020). Federated machine learning for wireless distributed computing resource allocation. IEEE Transactions on Cognitive Communications and Networking, 6(4), 1193-1206.
In the evolving landscape of medical data
analysis, clustering techniques play a pivotal role,
particularly in deciphering intricate patterns within
datasets, such as those linked to cancer diagnostics. With
the continuous expansion and increasing complexity of
healthcare data, there is a growing demand for effective
clustering algorithms capable of extracting significant
insights. Current trends underscore the necessity of
carefully selecting the most appropriate clustering method
to improve both the accuracy and interpretability of
analytical results. In this paper, we conduct a
comprehensive comparison of three prominent clustering
algorithms - KMeans, Agglomerative Clustering, and
Gaussian Mixture Model (GMM) - applied to a breast
cancer dataset comprising features from Fine Needle
Aspirates (FNA) of breast masses. Following a thorough
preprocessing and scaling of the features, we assess the
performance of these clustering techniques using the
Silhouette Score, Calinski-Harabasz Score, and Davies-
Bouldin Score. The findings reveal that KMeans provides
superior cluster separation and clarity relative to the other
algorithms. This research emphasizes the critical role of
algorithm selection based on specific dataset attributes and
evaluation metrics, aiming to enhance the accuracy of
clustering outcomes in breast cancer classification.
Keywords :
Clustering, Breast Cancer, KMeans, Agglomerative Clustering, Gaussian Mixture Model(GMM), Fine Needle Aspirates (FNA), Silhouette Score, Calinski- Harabasz Score, Davies-Bouldin Score, Medical Data Analysis, Deep Embedded Clustering.