Effect of stemming on text similarity for Arabic language at sentence level
العنوان: | Effect of stemming on text similarity for Arabic language at sentence level |
---|---|
المؤلفون: | Anwer Mustafa Hilal, Hikmat A. M. Abdeljaber, Mohammad Alhawarat |
المصدر: | PeerJ Computer Science PeerJ Computer Science, Vol 7, p e530 (2021) |
بيانات النشر: | PeerJ, 2021. |
سنة النشر: | 2021 |
مصطلحات موضوعية: | Word embedding, General Computer Science, Computer science, Data Mining and Machine Learning, 02 engineering and technology, computer.software_genre, Lemmatization, Naive Bayes classifier, Similarity (network science), Artificial Intelligence, Stemming, Machine learning, 0202 electrical engineering, electronic engineering, information engineering, Semantic text similarity, tf–idf, business.industry, Natural language processing, Lemmatisation, TF-IDF, 020206 networking & telecommunications, QA75.5-76.95, Natural Language and Speech, Computational Linguistics, Support vector machine, Stochastic gradient descent, Electronic computers. Computer science, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Sentence |
الوصف: | Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer. |
تدمد: | 2376-5992 |
الوصول الحر: | https://explore.openaire.eu/search/publication?articleId=doi_dedup___::05cb1f6c5b64f7d8d33b32b5a0146ad5Test https://doi.org/10.7717/peerj-cs.530Test |
حقوق: | OPEN |
رقم الانضمام: | edsair.doi.dedup.....05cb1f6c5b64f7d8d33b32b5a0146ad5 |
قاعدة البيانات: | OpenAIRE |
تدمد: | 23765992 |
---|