基于SMOTE+ENN的煤矿安全隐患文本分类研究

    Research on text classification of coal mine safety hazards based on SMOTE+ENN

    • 摘要: 在深度学习分类技术辅助煤矿安全隐患文本数据分析的研究中,行业真实数据的封闭性和风险类别分布不均衡性导致模型分类性能较差,影响企业对各类风险点进行有效的安全管理决策。为克服这类问题,本文提出将合成少数类过采样技术(Synthetic Minority Over-sampling Technique,SMOTE)与编辑最近邻规则欠采样技术(Edited Nearest Neighbor,ENN)相结合的混合采样方法,并基于卷积神经网络(Convolutional Neural Network,CNN)对安全隐患文本输出类别预测结果。以安全文库网中某煤业安全风险清单共计4 539条数据为例,首先,对安全隐患文本进行清洗、分词及向量化,采用SMOTE算法对小类别数据集进行插值样本生成,平衡各类样本之间数量分布差异;然后,利用ENN算法对生成的合成样本进行欠采样,剔除异常和噪声样本;最后,采用基于CNN的分类器对抽样后安全隐患文本进行建模和预测。实验结果表明,该方法相较于传统基准采样方法在准确率上提升了4%~8%,在F-Measure上提升了4%~7%,证明该方法在处理多类别不平衡煤矿安全隐患文本分类问题上的有效性和可行性,在煤矿安全管理和隐患预警等方面具有重要意义和实际应用价值。

       

      Abstract: In the research on the analysis of text data of coal mine safety hazard assisted by deep learning classification techniques, the closed nature of industry real data and the imbalanced distribution of risk categories lead to poor performance of the model classification, affecting the effective safety management decision-making of enterprises regarding various risk points. To overcome such problems, the paper proposes a hybrid sampling method that combines the Synthetic Minority Over-sampling Technique (SMOTE) with the Edited Nearest Neighbor (ENN) undersampling technique, and based on Convolutional Neural Network (CNN) for predicting the output categories of safety hazard texts. Taking a total of 4 539 data from a coal industry safety risk list in the Safety Library Website from 2019 to 2020 as an example, it first cleans, tokenizes, and vectorizes the safety hazard texts. It applies the SMOTE algorithm to interpolate and generate synthetic samples for the minority class data set to balance the differences in quantity distribution among various classes. Then, it uses the ENN algorithm to undersample the generated synthetic samples, eliminating any existing anomalies and noise samples. Finally, it uses a CNN-based classifier to model and predict the sampled safety hazard texts. Experimental results show that compared to traditional benchmark sampling methods, this approach improves accuracy by 4%-8% and F-Measure by 4%-7%, demonstrating its effectiveness and feasibility in addressing the problem of imbalanced coal mine safety hazard text classification. It holds significant importance and practical application value in coal mine safety management and hazard warning.

       

    /

    返回文章
    返回