WEI Feng, MA Long. Locality-sensitive hashing K-means algorithm for large-scale datasets[J]. Journal of Mine Automation,2023,49(3):53-62. DOI: 10.13272/j.issn.1671-251x.2022080018
Citation: WEI Feng, MA Long. Locality-sensitive hashing K-means algorithm for large-scale datasets[J]. Journal of Mine Automation,2023,49(3):53-62. DOI: 10.13272/j.issn.1671-251x.2022080018

Locality-sensitive hashing K-means algorithm for large-scale datasets

More Information
  • Received Date: August 04, 2022
  • Revised Date: March 09, 2023
  • Available Online: October 24, 2022
  • Efficient processing strategy for large datasets is a key support for coal mine intelligent constructions, such as the intelligent construction of coal mine safety monitoring and mining. To address the problem of insufficient clustering efficiency and accuracy of the K-means algorithm for large datasets, a highly efficient K-means clustering algorithm based on locality-sensitive hashing (LSH) is proposed. Based on LSH, the sampling process is optimized, and a data grouping algorithm LSH-G is proposed. The large dataset is divided into subgroups and the noisy points in the dataset are removed effectively. Based on LSH-G, the subgroup division process in the density biased sampling (DBS) algorithm is optimized. And a data group sampling algorithm, LSH-GD, is proposed. The sample set can more accurately reflect the distribution law of the original dataset. On this basis, the K-means algorithm is used to cluster the generated sample set, achieving efficient mining of effective data from large datasets with low time complexity. The experimental results show that the optimal cascade combination consists of 10 AND operations and 8 OR operations, resulting in the smallest sum of squares due to error of class center (SSEC). On the artificial dataset, compared with the K-means algorithm based on multi-layer simple random sampling (M-SRS), the K-means algorithm based on DBS, and the K-means algorithm based on grid density biased sampling (G-DBS), the K-means algorithm based on LSH-GD achieves an average improvement of 56.63%, 54.59%, and 25.34% respectively in clustering accuracy. The proposed algorithm achieves an average improvement of 27.26%, 16.81%, and 7.07% in clustering efficiency respectively. On the UCI standard dataset, the K-means clustering algorithm based on LSH-GD obtains optimal SSEC and CPU time consumption (CPU-C).
  • [1]
    杜毅博,赵国瑞,巩师鑫. 智能化煤矿大数据平台架构及数据处理关键技术研究[J]. 煤炭科学技术,2020,48(7):177-185. DOI: 10.13199/j.cnki.cst.2020.07.018

    DU Yibo,ZHAO Guorui,GONG Shixin. Study on big data platform architecture of intelligent coal mine and key technologies of data processing[J]. Coal Science and Technology,2020,48(7):177-185. DOI: 10.13199/j.cnki.cst.2020.07.018
    [2]
    武福生,卜滕滕,王璐. 煤矿安全智能化及其关键技术[J]. 工矿自动化,2021,47(9):108-112. DOI: 10.13272/j.issn.1671-251x.17833

    WU Fusheng,BU Tengteng,WANG Lu. Coal mine safety intelligence and key technologies[J]. Industry and Mine Automation,2021,47(9):108-112. DOI: 10.13272/j.issn.1671-251x.17833
    [3]
    胡青松,张赫男,李世银,等. 基于大数据与AI驱动的智能煤矿目标位置服务技术[J]. 煤炭科学技术,2020,48(8):121-130. DOI: 10.13199/j.cnki.cst.2020.08.015

    HU Qingsong,ZHANG Henan,LI Shiyin,et al. Intelligent coal mine target location service technology based on big data and AI driven[J]. Coal Science and Technology,2020,48(8):121-130. DOI: 10.13199/j.cnki.cst.2020.08.015
    [4]
    MAYA G P S,CHINTALA B R. Big data challenges and opportunities in agriculture[J]. International Journal of Agricultural and Environmental Information Systems,2020,11(1):48-66. DOI: 10.4018/IJAEIS.2020010103
    [5]
    叶鸥,窦晓熠,付燕,等. 融合轻量级网络和双重注意力机制的煤块检测方法[J]. 工矿自动化,2021,47(12):75-80. DOI: 10.13272/j.issn.1671-251x.2021030075

    YE Ou,DOU Xiaoyi,FU Yan,et al. Coal block detection method integrating lightweight network and dual attention mechanism[J]. Industry and Mine Automation,2021,47(12):75-80. DOI: 10.13272/j.issn.1671-251x.2021030075
    [6]
    温瑞英,王红勇. 基于因子分析和K−means聚类的空中交通复杂性评价[J]. 太原理工大学学报,2016,47(3):384-388,404. DOI: 10.16355/j.cnki.issn1007-9432tyut.2016.03.020

    WEN Ruiying,WANG Hongyong. Evaluation of air traffic complexity based on factor analysis and K-means clustering[J]. Journal of Taiyuan University of Technology,2016,47(3):384-388,404. DOI: 10.16355/j.cnki.issn1007-9432tyut.2016.03.020
    [7]
    SINAGA K P,YANG M-S. Unsupervised K-means clustering algorithm[J]. IEEE Access,2020,8:80716-80727. DOI: 10.1109/ACCESS.2020.2988796
    [8]
    BAIG A,MASOOD S,TARRAY T A. Improved class of difference-type estimators for population median in survey sampling[J]. Communications in Statistics-Theory and Methods,2019,49(23):5778-5793.
    [9]
    LIAO Kaiyang,LIU Guizhong. An efficient content based video copy detection using the sample based hierarchical adaptive k-means clustering[J]. Journal of Intelligent Information Systems,2015,44(1):133-158. DOI: 10.1007/s10844-014-0332-5
    [10]
    PALMER C R,FALOUTSOS C. Density biased sampling:an improved method for data mining and clustering[J]. ACM SIGMOD Record,2000,29(2):82-92. DOI: 10.1145/335191.335384
    [11]
    HUANG Jianbin,SUN Heli,KANG Jianmei,et al. ESC:an efficient synchronization-based clustering algorithm[J]. Knowledge-Based Systems,2013,40:111-122. DOI: 10.1016/j.knosys.2012.11.015
    [12]
    MINAEI-BIDGOLI B,PARVIN H,ALINEJAD-ROKNY H,et al. Effects of resampling method and adaptation on clustering ensemble efficacy[J]. Artificial Intelligence Review,2014,41(1):27-48. DOI: 10.1007/s10462-011-9295-x
    [13]
    AGGARWAL A, DESHPANDE A, KANNAN R. Adaptive sampling for K-means clustering[C]. 12th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems/13th International Workshop on Randomization and Computation, Berkeley, 2009: 15-28.
    [14]
    KUMAR K M,REDDY A R M. An efficient K-means clustering filtering algorithm using density based initial cluster centers[J]. Information Sciences,2017,418/419:286-301. DOI: 10.1016/j.ins.2017.07.036
    [15]
    SÁEZ J A,KRAWCZYK B,WOŹNIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern Recognition,2016,57:164-178. DOI: 10.1016/j.patcog.2016.03.012
    [16]
    胡欢. 多维数据上近似聚集和最近邻查询的高效算法[D]. 哈尔滨: 哈尔滨工业大学, 2021.

    HU Huan. Efficient algorithms for approximate aggregation and nearest neighbor queries over multi-dimensional data[D]. Harbin: Harbin Institute of Technology, 2021.
    [17]
    李建忠. 面向社交网络的科技领域事件检测系统的研究与实现[D]. 西安: 西安电子科技大学, 2019.

    LI Jianzhong. Researches and implementation of technology event detection in social networks[D]. Xi'an: Xidian University, 2019.
    [18]
    周萌. 基于多粒度级联森林哈希学习的图像检索[D]. 重庆: 重庆邮电大学, 2019.

    ZHOU Meng. Multi-grained cascade forest based hashing for image retrieval[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2019.
    [19]
    NELSON K P,THISTLETON W J. Comments on "Generalized Box-Müller method for generating q-Gaussian random deviates"[J]. IEEE Transactions on Information Theory,2021,67(10):6785-6789. DOI: 10.1109/TIT.2021.3071489
    [20]
    谷晓忱,张民选. 一种基于FPGA的高斯随机数生成器的设计与实现[J]. 计算机学报,2011,34(1):165-173. DOI: 10.3724/SP.J.1016.2011.00165

    GU Xiaochen,ZHANG Minxuan. Design and implementation of a FPGA based Gaussian random number generator[J]. Chinese Journal of Computers,2011,34(1):165-173. DOI: 10.3724/SP.J.1016.2011.00165
    [21]
    ARNAIZ-GONZÁLEZ Á,DÍEZ-PASTOR J-F,RODRÍGUEZ J J,et al. Instance selection of linear complexity for big data[J]. Knowledge-Based Systems,2016,107:83-95. DOI: 10.1016/j.knosys.2016.05.056
  • Related Articles

    [1]QIU Jiakai, WANG Ranfeng, FU Xiang. Design of intelligent control system for dense medium suspension density with wide domai[J]. Journal of Mine Automation, 2019, 45(7): 33-37. DOI: 10.13272/j.issn.1671-251x.17429
    [2]KONG Fanmiao, XU Kang, CHEN Zherui, CUI Qidong. Density control method for dense-medium suspension based on fuzzy control[J]. Journal of Mine Automation, 2018, 44(6): 101-104. DOI: 10.13272/j.issn.1671-251x.2018010087
    [3]GUO Xijin, SHAO Hongqing, YANG Chunbao, ZHANG Zhiqiang. Research on PFC-PID control algorithm of density and liquid level in heavy medium suspensio[J]. Journal of Mine Automation, 2018, 44(1): 89-95. DOI: 10.13272/j.issn.1671-251x.2017030088
    [4]CHENG Deqiang, LI Hang, HUANG Xiaoli, TU Yilei, YOU Dalei. Video mosaic algorithm based on improved random sample consensus algorithm[J]. Journal of Mine Automation, 2017, 43(8): 50-55. DOI: 10.13272/j.issn.1671-251x.2017.08.010
    [5]WANG Mi. Application research of improved K-means leave one out method in rejecting of abnormal samples of coal near infrared spectrum[J]. Journal of Mine Automation, 2016, 42(10): 60-64. DOI: 10.13272/j.issn.1671-251x.2016.10.014
    [6]LI Xiao-xin, WANG Ji-yu, NIU Yu-guang. Design of seepage line monitoring system for tailings dam based on high density resistivity method[J]. Journal of Mine Automation, 2013, 39(4): 20-23.
    [7]ZHANG Qian, LI Ming, WANG Xue-song. Research of Semi-supervised Regression Algorithm Based on Density Distributio[J]. Journal of Mine Automation, 2012, 38(3): 29-30.
    [8]ZHANG Shi-cong, WANG Bo, WANG Ran-feng. Reforming Practice of Monitoring and Control System of Heavy-media Density in Chengzhuang Coal Preparation Plant[J]. Journal of Mine Automation, 2011, 37(5): 12-14.
    [9]HU Wan-li, ZHANG Xiu-tai. Design of Sampling System of As-received Coal from Truck Based on PLC[J]. Journal of Mine Automation, 2010, 36(4): 101-103.
    [10]YANG Ni-ni~(, 2), YANG Jin~(1. Application Research of the High Density Resistivity Method for Detection of Karst of Coal Mine[J]. Journal of Mine Automation, 2008, 34(5): 1-4.
  • Cited by

    Periodical cited type(2)

    1. 季瑞翔. 一种基于视觉感知的带式输送机煤量测量方法研究. 山东煤炭科技. 2024(05): 110-114 .
    2. 解海燕,李杰,赵国栋. 非结构化高维大数据异常流量时间点挖掘算法. 计算机仿真. 2024(07): 474-478 .

    Other cited types(1)

Catalog

    Article Metrics

    Article views (308) PDF downloads (21) Cited by(3)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return