Vietnamese News Classification based on BoW with Keywords Extraction and Neural Network
[cite_start]Toan Pham Van [cite: 4]
[cite_start]Framgia Inc. R&D Group [cite: 5]
[cite_start]13F Keangnam Landmark 72 Tower [cite: 6]
[cite_start]Plot E6, Pham Hung, Nam Tu Liem, Ha Noi [cite: 7]
[cite_start]pham.van.toan@framgia.com [cite: 7]
[cite_start]Ta Minh Thanh [cite: 27]
[cite_start]Dept. of Network Technology [cite: 28]
Le Quy Don Technical University
[cite_start]236 Hoang Quoc Viet, Cau Giay, Ha Noi [cite: 29]
[cite_start]thanhtm@mta.edu.vn [cite: 29]
Abstract
[cite_start]Text classification (TC) is a primary application of Natural Language Processing (NLP). [cite: 8] [cite_start]While many research efforts exist for classifying text documents using methods like Random Forest, Support Vector Machines, and Naive Bayes, most are applied to English. [cite: 9, 10] [cite_start]Research on Vietnamese text classification remains limited. [cite: 10] [cite_start]This paper proposes methods to address Vietnamese news classification problems using a Vietnamese news corpus. [cite: 11] [cite_start]By employing Bag of Words (BoW) with keyword extraction and Neural Network approaches, a machine learning model was trained that achieved an average accuracy of approximately 99.75%. [cite: 12] [cite_start]The study also analyzes the merits and demerits of each method to identify the best one for this task. [cite: 13]
[cite_start]Keywords: Vietnamese Keywords Extraction, Vietnamese News Categorization, Text Classification, Neural Network, SVM, Random Forest, Natural Language Processing. [cite: 14]
I. Introduction
[cite_start]Text classification is a machine learning problem that involves labeling a text document with categories from a predefined set. [cite: 17] [cite_start]The goal is to build a system that can automatically label incoming news stories with a topic from a set of categories $C = (c_1, .., c_m)$. [cite: 21] [cite_start]With advancements in hardware, TC has become a crucial subfield of NLP. [cite: 21]
[cite_start]This paper applies popular multilabel classification algorithms like Naive Bayes, Random Forest, and multiclass SVM to Vietnamese text and compares their accuracy with a custom Neural Network. [cite: 23] [cite_start]A key challenge in processing Vietnamese compared to English is word boundary identification, as Vietnamese word boundaries are not always space characters. [cite: 29, 30] [cite_start]The process of recognizing linguistic units is called word segmentation, which is a critical step in text preprocessing. [cite: 33, 52] [cite_start]Inaccurate word segmentation leads to low accuracy in keyword extraction and, consequently, wrong classification. [cite: 56] [cite_start]After keyword extraction, a dictionary is created and used to train the classification model. [cite: 57, 58]
II. Related Works
A. Text Classification
[cite_start]TC assigns documents to one or more predefined categories. [cite: 64] [cite_start]Modern TC methods use a predefined corpus for training. [cite: 68] [cite_start]Features are extracted for each text category, and a classifier estimates similarities between texts to guess the category. [cite: 69, 70] [cite_start]State-of-the-art methods for English processing include Naive Bayes (NB), Support Vector Machine (SVM), and Convolutional Neural Network (CNN). [cite: 72]
B. Vietnamese Corpus
[cite_start]While standard corpora like Reuters and 20 Newsgroups are available for English, Vietnamese datasets are often restricted and small. [cite: 134, 135] [cite_start]This research uses a comprehensive Vietnamese corpus created by Vu Cong Duy and colleagues, which was constructed from four well-known Vietnamese online newspapers. [cite: 138, 140] [cite_start]The dataset contains a training set of 33,759 documents and a testing set of 50,373 documents across 10 main topics. [cite: 78, 79]
C. Keyword Extraction
[cite_start]Keyword extraction is a vital technique for text classification. [cite: 149] [cite_start]It involves finding unique, non-stop-word words and ordering them by frequency. [cite: 152, 153, 154] [cite_start]This paper uses the top ten keywords to calculate a Keyword Score to build a dictionary of keywords from the corpus. [cite: 154, 158]
D. Feature Selection
- [cite_start]Bag of Words (BoW) approach: This is a common method for representing text documents, where a document is described as a set of words with their associated frequencies, independent of the word sequence. [cite: 160, 162, 163, 164]
- [cite_start]Word Segmentation: A robust word segmentation method is crucial for document classification in Vietnamese. [cite: 166] [cite_start]The study uses vnTokenizer for this purpose. [cite: 167]
- [cite_start]Stop-words Removal: Common words that are not specific to different classes (e.g., "và", "bị") are removed. [cite: 169, 170, 171] [cite_start]A manually collected list of about 2000 stop-words was used. [cite: 172]
III. Text Classification Methods
[cite_start]After preprocessing the text and extracting numeric features from the BoW, supervised learning algorithms are applied. [cite: 174, 175]
A. Random Forest
[cite_start]Random Forest (RF) is a classifier that consists of a collection of tree-structured classifiers. [cite: 179, 180] [cite_start]It uses averaging to improve prediction accuracy and control over-fitting. [cite: 182] [cite_start]For classification problems, each tree casts a vote for the most popular class, and the final prediction is the average of the predictions from all trees. [cite: 181, 190]
B. SVM
[cite_start]Support Vector Machines (SVMs) work by determining the optimal hyperplane that best separates different classes. [cite: 197] [cite_start]For multiclass problems, the classifier maps a feature vector to a label by finding the class that has the highest similarity score. [cite: 211, 214]
C. Neural Network (NN)
[cite_start]The proposed Neural Network architecture consists of a neuron receiving a set of inputs (the BoW feature vector) and using a set of weights to compute an output. [cite: 218, 220, 221] [cite_start]This study employs a multi-layered feed-forward neural network with 6 hidden layers using the tanh activation function and optimized with stochastic gradient descent. [cite: 231, 242] [cite_start]The input layer corresponds to the BoW feature vector, and the output layer represents the document's label vector. [cite: 243]
IV. Result
[cite_start]The classification models were evaluated using precision, recall, and F1-score. [cite: 252] [cite_start]The proposed keyword extraction with BoW method (KEBOW) was compared against the N-gram method and other machine learning algorithms like SVM and Random Forest. [cite: 261, 262] [cite_start]The results showed that the KEBOW feature selection method was more effective than other methods on the same dataset. [cite: 274]
The Neural Network's performance was compared with other algorithms, as shown in the table below.
TABLE I: Accuracy Comparison Result [cite: 285]
| SVM | Random Forest | SVC | Neural Network | |
|---|---|---|---|---|
| 10 Topics Dataset | 0.9652 | 0.9921 | 0.9922 | 0.9975 |
| 27 Topics Dataset | 0.9780 | 0.9925 | 0.9965 | 0.9969 |
V. Conclusion and Future Works
[cite_start]The research proposed a new neural network architecture that achieved an average accuracy of 99.75% for Vietnamese text classification, outperforming methods like SVM and Random Forest on the same dataset. [cite: 281, 282] [cite_start]This result confirms the effectiveness of the proposed feature selection method combining keyword extraction and BoW. [cite: 284, 297]
Identified limitations include:
- [cite_start]The stop-words list was built subjectively. [cite: 299]
- [cite_start]The corpus has ambiguities between topics. [cite: 299]
- [cite_start]Word segmentation is limited by a third-party library. [cite: 301]
[cite_start]Future work will focus on improving the Neural Network's accuracy, addressing preprocessing disadvantages, and incorporating more semantic and contextual features. [cite: 302]
Application of Research
[cite_start]The results of this research were applied in Viblo, a technical knowledge-sharing service, to automatically classify posts upon publication. [cite: 304]
References
[1] B. Alexander, S. Thorsen, "A sentiment-based chat hot." (2013)[cite_start]. [cite: 312]
[2] Mooney. J. Raymond, Roy. [cite_start]Loriene, "Content-based book recommending using learning for text categorization," Proc. of the 5th ACM conference on Digital libraries, ACM, 2000. [cite: 313, 314]
[3] D. Dinh, V. Thuy. "A maximum entropy approach for Vietnamese word segmentation." Research, Innovation and Vision for the Future. [cite_start]International Conference on IEEE, 2006. [cite: 315, 316]
[4] D. Dien, H. Kiem, N.V.Toan, "Vietnamese Word Segmentation" Proc. of the 6th Natural Language Processing Pacific Rim Symposium, Tokyo. [cite_start]Japan, pp.749-756, 2001. [cite: 317, 318]
[5] Y. Yang and X. Liu. A re-examination of text categorization methods. In 22nd Annual International SIGIR, pp. 42-49, Berkley. [cite_start]August 1999. [cite: 319, 320]
[6] F. Sebastiani. Machine learning in automated text categorisation: a survey. [cite_start]Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, 1999. [cite: 321, 322]
[7] Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. [cite_start]In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 13-22. [cite: 323, 324]
[8] Thorsten Joachims. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features." [cite_start]Proc. of ECML-98, 10th European Conference on Machine Learning, No. 1398, pp. 137-142. [cite: 325, 326]
[9] Z. Xiang, J. Zhao, Y. LeCun, "Character-level convolutional networks for text classification." Advances in neural information processing systems. [cite_start]2015. [cite: 329, 330]
[10] H. V. C. Duy, et al. [cite_start]"A comparative study on vietnamese text classification methods," International Conf. on Research, Innovation and Vision for the Future, 2007. [cite: 331, 332]
[11] S. Fabrizio. "Machine learning in automated text categorization." ACM computing surveys (CSUR), no. 34, vol. [cite_start]1, pp. 1-47, 2002. [cite: 334]
[12] Hung Nguyen, Ha Nguyen, Thuc Vu, Nghĩa Tran, and Kiem Hoang. 2005. Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese. [cite_start]Proceedings of 4th IEEE International Conference on Computer Science Research, Innovation and Vision of the Future, 2006. [cite: 335, 336, 337]
[13] D. Gunawan, et al. "Automatic Text Summarization for Indonesian Language Using TextTeaser." IOP Conf. Series: Materials Science and Engineering, vol. 190. no. [cite_start]1, 2017. [cite: 338, 339]
[14] L. N. Minh, et al. "VNLP: an open source framework for Vietnamese natural language processing." [cite_start]Proc. of the Fourth Symposium on Information and Communication Technology, 2013. [cite: 340, 341]
[15] L. Breiman, "Random forests." [cite_start]UC Berkeley TR567, 1999. [cite: 342]
[cite_start][16] V. Vapnik, "Estimations of dependencies based on statistical data," Springer, 1982. [cite: 343]
[cite_start][17] C. Cortes, V. Vapnik, "Support-vector networks. Machine Learning," 20: pp. 273-297, 1995. [cite: 347]
[18] C.Koby, Y. Singer, "On the algorithmic implementation of multiclass kernel-based vector machines." [cite_start]J. of machine learning research, pp. 265-292, 2001. [cite: 348, 349]
[19] O. Guobin. Y. L. Murphey, "Multi-class pattern classification using neural networks." Pattern Recognition, vol. 40, no. 1. [cite_start]pp. 4-18, 2007. [cite: 350, 351]
[20] Yin, Xinyou, et al. "A flexible sigmoid function of determinate growth," Annals of botany, vol. 91, no. 3. [cite_start]pp. 361-371, 2003. [cite: 351, 352]
[21] G. Xavier, A. Bordes, Y. Bengio, "Deep sparse rectifier neural networks." [cite_start]Proc. of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011. [cite: 353, 354]
[22] B. Léon. "Large-scale machine learning with stochastic gradient descent." [cite_start]Proc of COMPSTAT 2010, pp. 177-186, 2010. [cite: 355, 356]
[23] K. Bekir, A. V. Olgac. "Performance analysis of various activation functions in generalized MLP architectures of neural networks." International J. of Artificial Intelligence and Expert Systems, vol. 1, no. [cite_start]4 pp. 111-122, 2011. [cite: 357, 358]
[24] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1. [cite_start]pp.1-47, 2002. [cite: 359]
[25] A. M. Salih, et al. "Modified extraction 2-thiobarbituric acid method for measuring lipid oxidation in poultry." Poultry Science, vol. 66, no. 9. [cite_start]pp. 1483-1488, 1987. [cite: 360, 361]