Abstract:
In the research on the analysis of text data of coal mine safety hazard assisted by deep learning classification techniques, the closed nature of industry real data and the imbalanced distribution of risk categories lead to poor performance of the model classification, affecting the effective safety management decision-making of enterprises regarding various risk points. To overcome such problems, the paper proposes a hybrid sampling method that combines the Synthetic Minority Over-sampling Technique (SMOTE) with the Edited Nearest Neighbor (ENN) undersampling technique, and based on Convolutional Neural Network (CNN) for predicting the output categories of safety hazard texts. Taking a total of 4 539 data from a coal industry safety risk list in the Safety Library Website from 2019 to 2020 as an example, it first cleans, tokenizes, and vectorizes the safety hazard texts. It applies the SMOTE algorithm to interpolate and generate synthetic samples for the minority class data set to balance the differences in quantity distribution among various classes. Then, it uses the ENN algorithm to undersample the generated synthetic samples, eliminating any existing anomalies and noise samples. Finally, it uses a CNN-based classifier to model and predict the sampled safety hazard texts. Experimental results show that compared to traditional benchmark sampling methods, this approach improves accuracy by 4%-8% and F-Measure by 4%-7%, demonstrating its effectiveness and feasibility in addressing the problem of imbalanced coal mine safety hazard text classification. It holds significant importance and practical application value in coal mine safety management and hazard warning.