Abstract:
To achieve efficient and accurate multi-label classification for mineral deposit description texts and reduce the difficulty of obtaining fine-grained knowledge from a large amount of text, it is necessary to construct labeled datasets and machine learning models purposefully. Using 17 kinds of content labels, such as geographic location, metallogenic zone, orebody geology, etc., 13 411 sentences from the
Geology of Mineral Resources in China: Overview of Typical Mineral Deposits are manually classified, constructing a labeled dataset of multi-label classification for mineral deposit description texts. The multi-label classification process is divided into three steps: tokenization, text vectorization, and classified calculation. Different methods are adopted at each step, forming 30 kinds of machine learning classification models. The classification performances of these models are evaluated and compared on the labeled dataset. The result of experiments shows: fine-tuned BERT combined with FNN can achieve a weighted
F1 score of 0.91, outperforming other models; TextCNN combined with K-nearest neighbors classifier can achieve a weighted
F1 score of 0.80; TF-IDF bag of words combined with FNN can achieve a weighted
F1 score of 0.76; when the methods in other steps are the same, models that use characters as tokens have relatively higher weighted
F1 scores. Machine learning models based on fine-tuned BERT can be used to replace or assist the manual multi-label classification for mineral deposit description texts. The machine learning model using TF-IDF bag of words has strong interpretability and can be used for optimize manual classification method.