留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向大型数据集的局部敏感哈希K−means算法

魏峰 马龙

魏峰,马龙. 面向大型数据集的局部敏感哈希K−means算法[J]. 工矿自动化,2023,49(3):53-62.  doi: 10.13272/j.issn.1671-251x.2022080018
引用本文: 魏峰,马龙. 面向大型数据集的局部敏感哈希K−means算法[J]. 工矿自动化,2023,49(3):53-62.  doi: 10.13272/j.issn.1671-251x.2022080018
WEI Feng, MA Long. Locality-sensitive hashing K-means algorithm for large-scale datasets[J]. Journal of Mine Automation,2023,49(3):53-62.  doi: 10.13272/j.issn.1671-251x.2022080018
Citation: WEI Feng, MA Long. Locality-sensitive hashing K-means algorithm for large-scale datasets[J]. Journal of Mine Automation,2023,49(3):53-62.  doi: 10.13272/j.issn.1671-251x.2022080018

面向大型数据集的局部敏感哈希K−means算法

doi: 10.13272/j.issn.1671-251x.2022080018
基金项目: 国家重点研发计划资助项目(2021YFB3201905)。
详细信息
    作者简介:

    魏峰(1981—),男,山西太原人,研究员,硕士,现主要从事煤矿智能安全监测系统等方面的研究工作,E-mail:weifeng829@163.com

    通讯作者:

    马龙(1989—),男,山西孝义人,助理研究员,博士,主要研究方向为数据挖掘、煤矿机器人技术,E-mail:444647811@qq.com

  • 中图分类号: TD67

Locality-sensitive hashing K-means algorithm for large-scale datasets

  • 摘要: 大型数据集高效处理策略是煤矿安全监测智能化、采掘智能化等煤矿智能化建设的关键支撑。针对K−means算法面对大型数据集时聚类高效性及准确性不足的问题,提出了一种基于局部敏感哈希(LSH)的高效K−means聚类算法。基于LSH对抽样过程进行优化,提出了数据组构建算法LSH−G,将大型数据集合理划分为子数据组,并对数据集中的噪声点进行有效删除;基于LSH−G算法优化密度偏差抽样(DBS)算法中的子数据组划分过程,提出了数据组抽样算法LSH−GD,使样本集能更真实地反映原始数据集的分布规律;在此基础上,通过K−means算法对生成的样本集进行聚类,实现较低时间复杂度情况下从大型数据集中高效挖掘有效数据。实验结果表明:由10个AND操作与8个OR操作组成的级联组合为最优级联组合,得到的类中心误差平方和(SSEC)最小;在人工数据集上,与基于多层随机抽样(M−SRS)的K−means算法、基于DBS的K−means算法及基于网格密度偏差抽样(G−DBS)的K−means算法相比,基于LSH−GD的K−means算法在聚类准确性方面的平均提升幅度分别为56.63%、54.59%及25.34%,在聚类高效性方面的平均提升幅度分别为27.26%、16.81%及7.07%;在UCI标准数据集上,基于LSH−GD的K−means聚类算法获得的SSEC与CPU消耗时间(CPU−C)均为最优。

     

  • 图  1  $ \left( {{d_1},{d_2},{p_1},{p_2}} \right) $敏感哈希函数的概率分布

    Figure  1.  Probability distribution of sensitive hashing function $ \left( {{d_1},{d_2},{p_1},{p_2}} \right) $

    图  2  不同级联数量情况下LSH−G算法筛选出的数据点噪声变化

    Figure  2.  Noise variation of data points screened by LSH-G algorithm under different cascade numbers

    图  3  进行10%抽样后得到的样本集

    Figure  3.  Sample set obtained after 10% sampling

    图  4  数据量变化对聚类准确性的影响

    Figure  4.  Impact of data size change on clustering accuracy

    图  5  数据维度变化对聚类准确性的影响

    Figure  5.  Impact of data dimension change on clustering accuracy

    图  6  数据类个数变化对聚类准确性的影响

    Figure  6.  Impact of data class number change on clustering accuracy

    图  7  数据量变化对聚类高效性的影响

    Figure  7.  Impact of data size change on clustering efficiency

    图  8  数据维度变化对聚类高效性的影响

    Figure  8.  Impact of data dimension change on clustering efficiency

    图  9  数据类个数变化对聚类高效性的影响

    Figure  9.  Impact of data class number change on clustering efficiency

    表  1  测试数据集特征

    Table  1.   Test dataset characteristics

    编号数据集名称数据量/个维度类个数
    1数据集120 00053
    2数据集240 00053
    3数据集360 00053
    4数据集480 00053
    5数据集5120 00053
    6数据集6220 00053
    7数据集740 000103
    8数据集840 000153
    9数据集940 000203
    10数据集1040 000403
    11数据集1140 00056
    12数据集1240 00059
    13数据集1340 000512
    14数据集1440 000527
    15Pen−based10 9921610
    16Shuttle14 50097
    17Magic19 020102
    18Letter Recognition20 0001626
    19Statlog58 00095
    下载: 导出CSV

    表  2  不同级联组合下19个样本集的SSEC

    Table  2.   The SSEC of 19 sample sets under different cascaded combinations

    ASSEC
    O=1O=2O=3O=4O=5O=6O=7O=8O=9O=10
    111.62511.08010.3259.6859.0728.5867.9417.3136.8416.309
    210.95410.6069.8529.2588.6958.1817.5296.9976.3585.879
    310.1079.7589.3558.8578.5817.8857.2846.6226.0215.514
    49.2829.0418.7588.5868.2997.6027.0856.4985.8845.392
    58.3547.9117.6897.3267.2816.9896.6516.1855.6465.218
    67.0596.8826.7116.6326.4986.2865.9895.7145.3274.881
    75.9445.8015.6935.3895.1544.9974.7274.5894.3014.082
    84.7364.6884.5124.3874.0973.8273.5813.1582.9292.897
    93.9353.5593.3323.2013.0232.8352.6062.4832.2982.255
    103.1033.0222.9152.8012.6842.5372.4632.1422.1972.205
    下载: 导出CSV

    表  3  不同数据量数据集的SSEC

    Table  3.   The SSEC of datasets with different data size

    数据集序号数据量SSEC
    算法1算法2算法3算法4
    1 20000 0.0812 0.0686 0.0628 0.0527
    2 40000 0.0832 0.0781 0.0611 0.0505
    3 60000 0.0904 0.0815 0.0587 0.0463
    4 80000 0.0926 0.0784 0.0501 0.0412
    5 120000 0.1051 0.0801 0.0491 0.0369
    6 220000 0.2057 0.0947 0.0452 0.0305
    下载: 导出CSV

    表  4  不同维度数据集的SSEC

    Table  4.   The SSEC of datasets with different dimensions

    数据集序号维度平均SSEC
    算法1算法2算法3算法4
    2 5 0.0166 0.0156 0.0122 0.0101
    7 10 0.0171 0.0165 0.0144 0.0111
    8 15 0.0193 0.0215 0.0142 0.0095
    9 20 0.0207 0.0257 0.0132 0.0082
    10 40 0.0288 0.0360 0.0124 0.0059
    下载: 导出CSV

    表  5  不同类个数数据集的SSEC

    Table  5.   The SSEC of datasets with different number of data class

    数据集序号类个数平均SSEC
    算法1算法2算法3算法4
    2 3 0.0277 0.0260 0.0204 0.0168
    11 6 0.0315 0.0356 0.0178 0.0162
    12 9 0.0448 0.0548 0.0173 0.0157
    13 12 0.0591 0.0687 0.0198 0.0154
    14 27 0.0798 0.0975 0.0251 0.0115
    下载: 导出CSV

    表  6  不同数据量数据集的CPU−C

    Table  6.   The CPU-C of datasets with different data size

    数据集序号数据量CPU−C/s
    算法1算法2算法3算法4
    1 20000 4.782 3.964 3.728 3.697
    2 40000 5.525 4.698 4.388 4.171
    3 60000 6.413 6.422 6.132 5.981
    4 80000 8.522 7.335 6.716 6.435
    5 120000 12.606 9.318 7.931 7.386
    6 220000 16.131 14.969 10.949 9.852
    下载: 导出CSV

    表  7  不同数据维度数据集的CPU−C

    Table  7.   The CPU-C of dataset with different dimensions

    数据集序号维度CPU−C/s
    算法1算法2算法3算法4
    2 5 5.525 4.698 4.388 4.171
    7 10 6.091 5.131 4.779 4.433
    8 15 6.788 5.764 5.286 4.825
    9 20 7.645 6.598 5.862 5.276
    10 40 10.481 10.771 8.535 7.411
    下载: 导出CSV

    表  8  不同类个数数据集的CPU−C

    Table  8.   The CPU-C of datasets with different number of class

    数据集序号类个数CPU−C/s
    算法1算法2算法3算法4
    2 3 5.525 4.698 4.388 4.171
    11 6 6.223 5.308 4.858 4.801
    12 9 6.981 5.991 5.403 5.314
    13 12 7.831 6.866 6.286 5.757
    14 27 12.481 11.986 9.997 7.634
    下载: 导出CSV

    表  9  4种算法在UCI标准数据集上的SSEC与CPU−C

    Table  9.   The SSEC and CPU-C of 4 algorithms on the UCI standard dataset

    数据集算法1算法2算法3算法4
    SSECCPU−C/sSSECCPU−C/sSSECCPU−C/sSSECCPU−C/s
    Pen-based6298.2313.6916717.0813.0835916.7412.6265495.6111.764
    Shuttle2965.1411.9723117.2510.8362403.8710.4912214.6610.037
    Magic2584.6210.1312615.549.4512478.779.0772398.238.235
    Letter Recognition1098.2515.7991311.8714.0821027.2513.221936.4811.943
    Statlog2574.1814.8212724.8313.1872315.2511.8422080.5410.294
    下载: 导出CSV
  • [1] 杜毅博,赵国瑞,巩师鑫. 智能化煤矿大数据平台架构及数据处理关键技术研究[J]. 煤炭科学技术,2020,48(7):177-185. doi: 10.13199/j.cnki.cst.2020.07.018

    DU Yibo,ZHAO Guorui,GONG Shixin. Study on big data platform architecture of intelligent coal mine and key technologies of data processing[J]. Coal Science and Technology,2020,48(7):177-185. doi: 10.13199/j.cnki.cst.2020.07.018
    [2] 武福生,卜滕滕,王璐. 煤矿安全智能化及其关键技术[J]. 工矿自动化,2021,47(9):108-112. doi: 10.13272/j.issn.1671-251x.17833

    WU Fusheng,BU Tengteng,WANG Lu. Coal mine safety intelligence and key technologies[J]. Industry and Mine Automation,2021,47(9):108-112. doi: 10.13272/j.issn.1671-251x.17833
    [3] 胡青松,张赫男,李世银,等. 基于大数据与AI驱动的智能煤矿目标位置服务技术[J]. 煤炭科学技术,2020,48(8):121-130. doi: 10.13199/j.cnki.cst.2020.08.015

    HU Qingsong,ZHANG Henan,LI Shiyin,et al. Intelligent coal mine target location service technology based on big data and AI driven[J]. Coal Science and Technology,2020,48(8):121-130. doi: 10.13199/j.cnki.cst.2020.08.015
    [4] MAYA G P S,CHINTALA B R. Big data challenges and opportunities in agriculture[J]. International Journal of Agricultural and Environmental Information Systems,2020,11(1):48-66. doi: 10.4018/IJAEIS.2020010103
    [5] 叶鸥,窦晓熠,付燕,等. 融合轻量级网络和双重注意力机制的煤块检测方法[J]. 工矿自动化,2021,47(12):75-80. doi: 10.13272/j.issn.1671-251x.2021030075

    YE Ou,DOU Xiaoyi,FU Yan,et al. Coal block detection method integrating lightweight network and dual attention mechanism[J]. Industry and Mine Automation,2021,47(12):75-80. doi: 10.13272/j.issn.1671-251x.2021030075
    [6] 温瑞英,王红勇. 基于因子分析和K−means聚类的空中交通复杂性评价[J]. 太原理工大学学报,2016,47(3):384-388,404. doi: 10.16355/j.cnki.issn1007-9432tyut.2016.03.020

    WEN Ruiying,WANG Hongyong. Evaluation of air traffic complexity based on factor analysis and K-means clustering[J]. Journal of Taiyuan University of Technology,2016,47(3):384-388,404. doi: 10.16355/j.cnki.issn1007-9432tyut.2016.03.020
    [7] SINAGA K P,YANG M-S. Unsupervised K-means clustering algorithm[J]. IEEE Access,2020,8:80716-80727. doi: 10.1109/ACCESS.2020.2988796
    [8] BAIG A,MASOOD S,TARRAY T A. Improved class of difference-type estimators for population median in survey sampling[J]. Communications in Statistics-Theory and Methods,2019,49(23):5778-5793.
    [9] LIAO Kaiyang,LIU Guizhong. An efficient content based video copy detection using the sample based hierarchical adaptive k-means clustering[J]. Journal of Intelligent Information Systems,2015,44(1):133-158. doi: 10.1007/s10844-014-0332-5
    [10] PALMER C R,FALOUTSOS C. Density biased sampling:an improved method for data mining and clustering[J]. ACM SIGMOD Record,2000,29(2):82-92. doi: 10.1145/335191.335384
    [11] HUANG Jianbin,SUN Heli,KANG Jianmei,et al. ESC:an efficient synchronization-based clustering algorithm[J]. Knowledge-Based Systems,2013,40:111-122. doi: 10.1016/j.knosys.2012.11.015
    [12] MINAEI-BIDGOLI B,PARVIN H,ALINEJAD-ROKNY H,et al. Effects of resampling method and adaptation on clustering ensemble efficacy[J]. Artificial Intelligence Review,2014,41(1):27-48. doi: 10.1007/s10462-011-9295-x
    [13] AGGARWAL A, DESHPANDE A, KANNAN R. Adaptive sampling for K-means clustering[C]. 12th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems/13th International Workshop on Randomization and Computation, Berkeley, 2009: 15-28.
    [14] KUMAR K M,REDDY A R M. An efficient K-means clustering filtering algorithm using density based initial cluster centers[J]. Information Sciences,2017,418/419:286-301. doi: 10.1016/j.ins.2017.07.036
    [15] SÁEZ J A,KRAWCZYK B,WOŹNIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern Recognition,2016,57:164-178. doi: 10.1016/j.patcog.2016.03.012
    [16] 胡欢. 多维数据上近似聚集和最近邻查询的高效算法[D]. 哈尔滨: 哈尔滨工业大学, 2021.

    HU Huan. Efficient algorithms for approximate aggregation and nearest neighbor queries over multi-dimensional data[D]. Harbin: Harbin Institute of Technology, 2021.
    [17] 李建忠. 面向社交网络的科技领域事件检测系统的研究与实现[D]. 西安: 西安电子科技大学, 2019.

    LI Jianzhong. Researches and implementation of technology event detection in social networks[D]. Xi'an: Xidian University, 2019.
    [18] 周萌. 基于多粒度级联森林哈希学习的图像检索[D]. 重庆: 重庆邮电大学, 2019.

    ZHOU Meng. Multi-grained cascade forest based hashing for image retrieval[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2019.
    [19] NELSON K P,THISTLETON W J. Comments on "Generalized Box-Müller method for generating q-Gaussian random deviates"[J]. IEEE Transactions on Information Theory,2021,67(10):6785-6789. doi: 10.1109/TIT.2021.3071489
    [20] 谷晓忱,张民选. 一种基于FPGA的高斯随机数生成器的设计与实现[J]. 计算机学报,2011,34(1):165-173. doi: 10.3724/SP.J.1016.2011.00165

    GU Xiaochen,ZHANG Minxuan. Design and implementation of a FPGA based Gaussian random number generator[J]. Chinese Journal of Computers,2011,34(1):165-173. doi: 10.3724/SP.J.1016.2011.00165
    [21] ARNAIZ-GONZÁLEZ Á,DÍEZ-PASTOR J-F,RODRÍGUEZ J J,et al. Instance selection of linear complexity for big data[J]. Knowledge-Based Systems,2016,107:83-95. doi: 10.1016/j.knosys.2016.05.056
  • 加载中
图(11) / 表(9)
计量
  • 文章访问数:  291
  • HTML全文浏览量:  61
  • PDF下载量:  14
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-08-05
  • 修回日期:  2023-03-10
  • 网络出版日期:  2022-10-25

目录

    /

    返回文章
    返回