Effect of stemming on text similarity for Arabic language at sentence level

التفاصيل البيبلوغرافية
العنوان: Effect of stemming on text similarity for Arabic language at sentence level
المؤلفون: Anwer Mustafa Hilal, Hikmat A. M. Abdeljaber, Mohammad Alhawarat
المصدر: PeerJ Computer Science
PeerJ Computer Science, Vol 7, p e530 (2021)
بيانات النشر: PeerJ, 2021.
سنة النشر: 2021
مصطلحات موضوعية: Word embedding, General Computer Science, Computer science, Data Mining and Machine Learning, 02 engineering and technology, computer.software_genre, Lemmatization, Naive Bayes classifier, Similarity (network science), Artificial Intelligence, Stemming, Machine learning, 0202 electrical engineering, electronic engineering, information engineering, Semantic text similarity, tf–idf, business.industry, Natural language processing, Lemmatisation, TF-IDF, 020206 networking & telecommunications, QA75.5-76.95, Natural Language and Speech, Computational Linguistics, Support vector machine, Stochastic gradient descent, Electronic computers. Computer science, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Sentence
الوصف: Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.
تدمد: 2376-5992
الوصول الحر: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::05cb1f6c5b64f7d8d33b32b5a0146ad5Test
https://doi.org/10.7717/peerj-cs.530Test
حقوق: OPEN
رقم الانضمام: edsair.doi.dedup.....05cb1f6c5b64f7d8d33b32b5a0146ad5
قاعدة البيانات: OpenAIRE