基于机器学习的矿床描述文本多标签分类

赵锴; 叶丹

doi:10.12075/j.issn.1004-4051.20241054

基于机器学习的矿床描述文本多标签分类

赵锴,
叶丹

Multi-label classification for mineral deposit description texts based on machine learning

ZHAO Kai,
YE Dan

摘要

摘要: 为实现高效准确的矿床描述文本多标签分类，降低从大量文本中获取细粒度知识的难度，需要构建有针对性的标注数据集和机器学习模型。使用地理位置、成矿区带、矿体地质等17种内容标签，为《中国矿产地质志·典型矿床总述卷》中的13 411个句子实施人工分类标注，构建了一个矿床描述文本多标签分类标注数据集。将多标签分类流程拆解为划分特征单元、文本向量化、分类计算三个步骤，在每个步骤分别采用不同方法，形成30种机器学习分类模型，在标注数据集上测试并比较了这些模型的分类性能。试验结果显示：微调BERT模型搭配FNN分类器时加权F1值可达到0.91，优于其他模型；TextCNN模型搭配K近邻分类器时加权F1值可达到0.80；TF-IDF词袋模型搭配FNN分类器时加权F1值可达到0.76；在其他步骤方法相同的情况下，按字符划分特征单元的模型加权F1值相对较高。基于微调BERT的机器学习模型可用于替代或辅助矿床描述文本多标签人工分类。使用TF-IDF词袋的机器学习模型可解释性较强，可用于优化人工分类方法。

Abstract: To achieve efficient and accurate multi-label classification for mineral deposit description texts and reduce the difficulty of obtaining fine-grained knowledge from a large amount of text, it is necessary to construct labeled datasets and machine learning models purposefully. Using 17 kinds of content labels, such as geographic location, metallogenic zone, orebody geology, etc., 13 411 sentences from the Geology of Mineral Resources in China: Overview of Typical Mineral Deposits are manually classified, constructing a labeled dataset of multi-label classification for mineral deposit description texts. The multi-label classification process is divided into three steps: tokenization, text vectorization, and classified calculation. Different methods are adopted at each step, forming 30 kinds of machine learning classification models. The classification performances of these models are evaluated and compared on the labeled dataset. The result of experiments shows: fine-tuned BERT combined with FNN can achieve a weighted F1 score of 0.91, outperforming other models; TextCNN combined with K-nearest neighbors classifier can achieve a weighted F1 score of 0.80; TF-IDF bag of words combined with FNN can achieve a weighted F1 score of 0.76; when the methods in other steps are the same, models that use characters as tokens have relatively higher weighted F1 scores. Machine learning models based on fine-tuned BERT can be used to replace or assist the manual multi-label classification for mineral deposit description texts. The machine learning model using TF-IDF bag of words has strong interpretability and can be used for optimize manual classification method.

HTML全文

参考文献(20)

施引文献

资源附件(0)