Site Loader

Xhosa Single-Document Text Summarization Using an Advanced Xhosa Stemmer Algorithm
Zukile Ndyalivana, Zelalem Shibeshi
Department of Computer Science
University of Fort Hare, P. O. Box X1314, Alice 5700, RSA
Tel: +27 40 6022464, Fax: +27 40 6022464
Email: {zndyalivana, zshibeshi}

Abstract – The rapid spread of textual information electronically has made it cumbersome for users upon to find pertinent information that will be of practicable use to them. Textual information can be in the form of scientific abstracts, movie previews, reviews, and information articles. Summarizing these texts can help users to get entry to the data content material at a faster and reliable speed. However, performing this task manually is difficult besides it is time consuming. Automatic text summarization is a wonderful and at least dependable solution for dealing with such a problem. Automatic text summarization is an approach that consists of choosing substantial contents from a document and connects them into a quick summary. In the literature, weight-based, foci-based, and desktop getting to know techniques has been suggested. In this paper, the authors recommend an automated summarization of Xhosa information articles. Extraction of relevant sentences has been finished in the proposed method and this gives the actual notion of input file in the concise form. Ranking of sentences has been carried out by assigning a weight price to person phrases of the sentence. The sentenced are extracted and sentences are ranked accordingly. After that process, a meaningful precis of the given textual content is made.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Index Terms-Automatic text summarization, single document summarization, sentence extraction, isiXhosa language.

I, Zukile Ndyalivana declared that the contents of this document belong to me. This document has never been submitted or published.
The boom of textual facts accessible electronically has made it difficult for customers to gain facts that are doubtlessly of interest to them. Users are subjected to records overload because of this variety of a situation. Approximately, in all the languages that exist in the entire world, text in any specific domain is written in full details and in this case, users are obliged to see pointless important points that they are now not fascinated in.

In such a case, even Xhosa textual content readers are also susceptible to this issue. Many domains do exist and they produce large content material of textual facts, which requires summarization to shop the time of the customers (readers).Some of the textual statistics are large two volumes of news texts and online information articles that are produced by way of the media agencies, reports from authority’s workplaces, etc.

Now, newspapers collectively with other news releases in the language reach the readers from many sources. There are a big wide variety of media corporations and presses releasing news in association that is electronic and non-digital. The shortage of automated textual content summarization offerings in the Xhosa language that can doubtlessly minimize the time readers take in shopping and reading, it can be demonstrated that readers have been spending more time than they going through the content that they are not even have pastime in. The work that is presented in this paper serves as a contribution closer to developing natural language processing functions for isiXhosa Language.

Specifically what this work does is, it will increase the scope of text summarization research discipline via exploring its usefulness for isiXhosa language. In this work, we base our focal point on novel strategies with herbal language toolkit making use of the Tokenization modules in the NLTK. Term Frequency and Sentence Position were used to assign weights to the sentences to be extracted to make a summary. An advanced stemmer for isiXhosa language for stripping phrases into their root shape was once additionally used.

IsiXhosa as referred to by language speakers is one of the major South African languages that are spoken in certain provinces of the country. It is one of the country’s eleven official languages. According to Stats SA, approximately 18% of the entire population speaks Xhosa .It is widely spoken on the eastern and western side of the country. The language is closely related to isiZulu, which is the second widely spoken in country. Niesler, Louw, and Roux (2005) stated that, since the beginning of the 1857,” the term ‘Bantu’ was studied in language lessons.

IsiXhosa is used as medium of instruction from bring down grades to higher grades all through the schools in the Eastern and the Western Cape locale. The language is used as subject in many schools in the Eastern Cape. There is adequate writing work, daily papers on the web and on hard-papers, religious archives that are distributed and which are can be found in the dialect.

IsiXhosa is one of the Nguni dialects like siSwati, Ndebele, which in like manner share same phonetic structures. Because of this reality, these dialects shape some portion of a greater class alluded to as Bantu dialects. It has numerous tongues, for example, Mpondo, Gcaleka and it has a Latin based orthography. IsiXhosa is a tonal dialect as one of its notable component (Pascoe, M., and Smouse, M. 2012).

The lookup work on text summarization can be traced back to 1950’s when the first extractive device developed through (Luhn H. P, 1958). He proposed that words appearing many times in a text furnish right thought about the content of the record though there are words that appear very often but not content bearing. As a result, he tried to reduce off these phrases by means of finding out a constant threshold. The notion of Luhn was once stated and used in many computerized information-processing systems. A domain precise single file summarization device summarizes technical articles. The device uses elements like time period filtering and phrase frequency (low-frequency phrases are removed). Sentences are weighted via the enormous terms they contained and sentence.

The author in Luhn H. P. (1958) elevated the work of Edmondson H. P (1969). The author in Luhn H. P. (1958) cautiously outlined the human extraction ideas and noticed that the place of a sentence in a text gives some clue about the significance of the sentence.
As a result, the author suggested word frequency, cue phrases, title, and heading words and sentence place as an extraction feature. Similar to work carried out by means of Luhn H. P. (1958). Edmondson’s gadget is a single record and area unique (specifically offers with technical articles). Furthermore, the machine outputs an extract in the structure of a precis.

GreekSum: An automated text summarizer created by Pachantouris G. also, DalianisH. (2005) for the Greek dialect. It is assembled and depends on the designs created by SweSum (Dalianis, n.d.) content summarizer for Swedish. As per Pachantouris G. also, Dalianis H. (2005) a few changes should have been made to help the distinctions of the Greek dialect from the Swedish. A performance of SweSum that is language independent called Non-specific (without Greek catchphrase word reference) and the altered form of the summarizer for Greek dialect called GreekSum were thought about. They completed what is called subjective assessment where they found that making utilization of the Greek catchphrase lexicon in GreekSum got huge change terms of execution (16 percent change was discovered contrasted with the framework that did not utilize a word reference).

The work by (Hassel M., 1999) is an endeavour to make an automatic text summarizer machine for Persian language. This content summarizer is electronic printed content summarizer for the most part for Persian and is construct absolutely upon the Swedish literary substance summarizer alluded to as SweSum. The Persian content summarizer outlines Persian daily paper content/HTML in Unicode design. Farsi Aggregate makes utilization of a similar structure used by methods for SweSum (Dalianis, H., at el 2003) with the rejection of the dictionaries, of course, there are a few changes made in SweSum to make it suitable for Persian messages in Unicode format. Straightforward stop-list is used as a part of the request to channel and group the catchphrases in the content.

The author 9 also gives profound insight on a certain feature that can help in ascertaining important sentences in certain parts of the document. This is called sentence position. The author 9 took to experiment 200 paragraphs and got amazing results that of 85% of the paragraphs topic sentence came as the first one and on the other hand, 7% of the time the topic sentence was the last. The author (Baxendale, P. 1958) thus concluded that the exact method of selecting topic sentence would be to select one of the two.

At present, there are no researches in automatic text summarization that have been put forward for South African languages, predominantly for Xhosa text in different domains by adopting different methods.
This work serve as a contribution towards developing natural language processing applications for South African Languages. Specifically it increases the scope of the text summarization research by investigating its application for South African languages. The techniques used in this study is term frequency and positional value of sentence with language dependent lexicons (stop words and stemming).A small corpus of 200 news items was collected and prepared.


This study proposes the development of an extractive text summarizer for the isiXhosa. An iterative style methodology has been accustomed succeed this. As this analysis was administered in isiXhosa news text summarization, the structure of the documents was investigated. The documents were additionally used for testing. To complete this task, there are certain aspects taken care of like literary analysis, books journal articles and a few web site that publish their news in isiXhosa.
A. Preparation of Xhosa Corpus
The preparation of a corpus was conducted for the analysis of isiXhosa text summarizer. This was done, as a result of there is no previous work on the development of a corpus in isiXhosa specifically a news corpus. The corpus comprised of 200 news items from very different sources. To collect the Xhosa news corpus, electronic versions of the articles were downloaded and adjusted to an apparent text format.

The summarizer was developed using python programming language. The Natural Language Toolkit library was used for text analysis. The python programming language was used because the authors wanted to make use of the modules in the natural language toolkit. These modules are good for text analysis.

Feature Amount
Number of articles 200
Number of words 50901
Total number of sentences in the file 3289
Total number of characters in the file 50901
Total number of paragraph in the file 1301

B. Summarization method
According Kaili M. and amp, Pilleriin M. (2005), analysis on summarization methods largely ever since it is starting within the late centuries still depends on text extraction to make a summary. Many summarization methods is accustomed characterize text summarization.
This study has proposed the extraction method that has been used for summarization of the single Xhosa news text. The extraction technique can just extract sentences as they are from text document and display summary to the reader .Making use of this type of method does not require the use of deep linguistic analysis to construct the summary nor does not need to be rewritten.
The extraction of important sentence from a text to be summarized can be weighted using cue phrases the sentence contains, sentence location, sentence with the most frequent words in the text document. Then a sentence that has the highest weight attained by well-organized combination of extraction features will be selected and a summary is constructed.
C. Evaluation method
The nature of summarization evaluation is subjective. It is not an easy task to ascertain if a summary is bad or good. It is necessary to use both human evaluation methods and automatic (machine-based) evaluation methods when making evaluation of summaries.

The evaluation method used tis research is called an intrinsic method. This process therefore evaluates the summary subjectively and objectively. On the subjective evaluation part, there some aspects that the authors closely look at, linguistic quality such informativeness, and how coherent the summary. The Greek text summarizer 6 has adapted the subjective evaluation method in their text summarizer. In the linguistic quality, the researchers and experts also look at the readability and fluency of the summary.

Objective evaluation on the other side specifically looks at the performance of the system itself .Is the system able to extract and salient sentences. The performance of the system is measured on certain standards called recall and precision.
The evaluators of the summary (researcher and experts) look at both the system summary and the reference summary and they do so by looking at the relevancy of both summaries.

Making use of some predefined guidelines, judges allocate a score using a predefined scale to each summary that is under evaluation.

The judges so assign quantitative scores to the summaries, which is predicated on the various qualitative options like content, fluency etc. At this stage, human analysis strategies and automatic analysis strategies are used.

The Xhosa text summarizer particularly IsiXhoSum is predicated on the work given by Pascoe and Smouse (2012) and Luhn H. P. (1958) modifications are created to suit the necessities of isiXhosa language. The principles are not the quite identical and a great deal of changes and improvements are created. The principles and modifications are done by creation Xhosa corpus (list of text files).

The Xhosa corpus is read using PlaintextCorpusReader, a class in used for reading unprocessed text or data .The stop word list has been loaded to nltk_data and can be imported using the nltk.corpus class .For the summarizer to support the text summarization, significant changes had to be made on the stemmer, creation of new stop word list for isiXhosa.

A. Xhosa stop-word list
Stop words are words that bear no content at all. They do not play any significant role to a given text document. These words are not supposed to be stemmed .the common stop word list comprise of conjunctions, prepositions, articles and the so-called particles. The stop word is compiled by the author with the help of some few websites for Xhosa grammar.

Word Meaning
Ngaphezulu above
Ngaphantsi under
Ngaphambi before
emva Back
kwaye and
phambi kwe Before the
emva kwe Behind the
Ukuze So that
Kufuphi Nearby
Phakathi inside

B. Xhosa stemmer
In this specific work, the authors have used a lightweight stemmer for isiXhosa. This stemmer was originally developed (Nogwina, M; Shibeshi Z; & Mali, Z. 2014) .The stemmer unambiguously stems solely nouns except for a higher coherency within the outline some easier enhancements are created. The stemmer take a minimum of elite range of Xhosa verbs because of nous and verbs provide intending to the complete text document. The stemmer employed in this analysis is much influenced be the porter stemmer .The stemmer encompasses a predefined list of nouns and verbs. The first stemmer developed by was strictly developed in java and for simple access; the stemmer has been changed into python programming language (Nogwina, M; Shibeshi Z; & Mali, Z. 2014).

Word Stem Meaning
isiXhosana isiXhosa Xhosa
umlambokazi umvambo river
isikhomokazi isikhomo gender
indlwana indlw house
isimbonono isimbono habit
ixhegokazi ixhego Old man
intokazi into Thing, something


The Xhosa text summarizer has adopted sentence extraction as a form of its basis. There are only three major steps taken towards the process. (i) Pre-processing, sentence ranking and summary generation.

A. Pre-Processing
Like other text summarizers, IsiXhoSum conjointly makes use of pre-processing tasks to organize the document for processing. The pre-processing step includes tokenizing, to prevent word removal, and stemming.

Moreover, creating use of a stemmer, a word is split into its stem and affix. Affixes stripy may be replaced by another affix or replaced by white space as per the rule it matches with. The planning of a stemmer extremely depends on specific language. This thus desires some substantial linguistic information within the language. A typical straightforward stemmer algorithmic program includes eliminating suffixes employing a list of frequent suffixes, whereas a lot of advanced one would use morphological information to return up with a stem from the words. Since Nguni language is an especially inflectional language, stemming is vital tool once calculative word frequencies.

B. Sentence ranking
After a document has undergone all the pre-processing steps, the collection gets broken down into a set of sentences that will then be ranked .Ranking is done taking note of two important features: term frequency and sentence position.

Term Frequency is a frequency of a keyword performing in particular text document. It is the most primitive known method to be used for automatic text summarization since research began in this region. This method is based on the notion that the sentences that are most relevant are those that have the biggest number of the furthermost frequent words in the document (Luhn H. P. 1958). Luhn H. P. (1958) further states that this can only take place when stop words are not contained within in the document.

With the TF (term frequency) method, the importance value (score) of a sentence (IVs) is given by:
IV is simply a total score of an impotence value and is greatly based on the tf, where tf is the Term frequency.

Positional value (score) of a sentence s is calculated in a way that the first sentence of a text with the highest score will take the first position and the last sentence with the lowest score will be the last. The positional value for the sentence s is calculated using the following formula by joining two parameters for sentence ranking.
As a result, the total importance value (a score) of a given sentence, s (TIVs) is given by:
Where, we say c is called a continuous multiplicative factor. The value of c is thus two for first statement of first paragraph, 1.6 for first sentences of the other paragraphs. Merely the term frequency score weigh all other sentences. Successively TIVs, is the total score of significance value of a sentence and is based on term frequency and position value.

C. Summary Generation

After the ranking of sentence is a summary is produced and the summary is based on the scores and selecting the best top ranked sentences represented by K ,the value of K is provided by the user. The summary is reordered based on the way they appear in the text, for example if a sentence happens to be the first in the original text there is likelihood that it will appear first in the summary.

D. Our Algorithm
Firstly, the file should be .txt format
Split the sentences into paragraphs and sentences
Remove non-character letters
Tokenize words using tokenize module (Natural Language Toolkit)
Remove stop words using a Xhosa sop word list
Stem words using a lightweight Xhosa stemmer
Rank of individual terms using g the above formulas.

To test our Xhosa text summarizer, a collection of 200 news items documents from the Xhosa online newspaper,, an online Xhosa newspaper was accomplished. The documents are downloaded and saved in the text file format. The authors consider only one reference summary for evaluation for each document in the corpus. Evaluation of a system-generated summary is done by comparing it to the reference summary. There is fixed percentage of summary for auto summarization, which is 50%, which will reduce the summary into half of its original form.

FIGURE 1 shows the interface of Xhosa text summarizer and is the followed TABLE 4, which shows the results of English text summarizer. TABLE 5 shows the results from our text summarizer IsiXhoSum. FIGURE 2 show the relevancy of our system to manually summarizer text, the summaries made by English text summarizer.


Text ID Original Length Summary Length summary ratio
Text 1 1422
638 55.1

Text 2 1472

Text 3 2954
853 71.1

Text 4 1547 866 44.0

Text 5 1555

Text 5 1874 814 56.6

Text 6 2044 865 57.7

Text 7 2282 829 63.7

Text 8 1656 864 47.8

Text9 2285
558 75.6

Text 10 1865
899 51.8

Text 11 2034 595 70.7

Text 12 2171 807 62.8
Text 13 2584
938 63.7

Text 14 1422
638 55.1

Text 15 1472

Text ID Original Length Summary Length summary ratio
Text 1 2572 855 66.7
Text 2 2160 909 57.9
Text 3 2574 855 66.7
Text 4 2166 226 89.5
Text 5 4359 1587 63.5
Text 5 2279 598 73
Text 6 3046 661 78.2
Text 7 1650 329 80.0
Text 8 2280 932 59.1
Text9 2040 706 65.3
Text 10 1862 836 55.1
Text 11 1864 232 87.5
Text 12 1549 173 88.8
Text 13 4070 424 89.5
Text 14 2915 191 93.4
Text 15 2572 855 66.7
Text 14 1422
638 55.1


This study makes use the extraction method for isiXhosa text summarization. Sentences have been extracted the according to their weight and this is done by maintaining their order. The first sentence is kept with the notion that every first sentence has sort of a significance and therefore should be given first priority.

The summarization method used is extraction based; when important sentences are extracted, it is possible that there might be a proper noun on sentence and the sentence on the other one has a problem, which it uses as reference to the pro noun.

In this scenario, if the system when constructing a summary considers the second sentence and forgets about the first one, the semantics of that whole sentence are lost .This problem is not only found in this study but it is huge problem in the field of automatic text summarization. This s part of our future work.

This work is based on the research undertaken within the Telkom CoE in ICTD supported in part by Telkom SA, Tellabs, Saab Grintek Technologies, Easttel and Khula Holdings, THRIP, GRMDC and National Research Foundation of South Africa (UID: 86108). The opinions, findings and conclusions or recommendations expressed here are those of authors and none of the above sponsors accepts any liability whatsoever in this regard.


Africa, S. (2012). Statistics South Africa | The South Africa I Know, The Home I Understand. online Available at: Accessed 30 Jul. 2018.

Baxendale, P. (1958). Machine-made index for technical literature – an experiment. IBM Journal of Research Development, 2(4):354-361.

Dalianis, H., M. Hassel, J. Wedekind, D. Haltrup, K. de Smedt and T.C. Lech. (2003), “Automatic text summarization for the Scandinavian languages.” In Holmboe, H.ed.) Nordisk Sprogteknologi.

Edmondson H. P (1969), “New Methods in Automatic Extracting,” Computing, vol. 16, no. 2, pp. 264-285.

Hassel M. (1999) “Farsi Sum – A Persian text summarizer,” Cognitive Science, pp. 2-4.

Kaili M. & Pilleriin M. (2005), “ESTSUM – Estonian newspaper texts summarizer”, Proceedings of The Second Baltic Conference on Human Language Technologies Pp. 311-316.

Luhn H. P. (1958), “The Automatic Creation of Literature Abstracts”, pp. 159-165.

Niesler, T., Louw, P., & Roux, J. (2005). Phonetic analysis of Afrikaans, English, Xhosa, and Zulu using South African speech databases. Southern African Linguistics and Applied Language Studies, 23(4), 459–474.

Nogwina, M; Shibeshi Z; & Mali, Z. (2014). Towards Developing a Stemmer for the isiXhosa language. SATNAC Conference 2014, 31 August -03 September 2014, Boardwalk, Port Elizabeth, Eastern Cape, South Africa.

Pachantouris G. and DalianisH. (2005), “GreekSum A Greek Text Summarizer,” Word Journal of the International Linguistic Association, pp. 1-45.

Pascoe, M., & Smouse, M. (2012). Masithethe: speech and language development and difficulties in isiXhosa. South African medical journal = Suid-Afrikaanse geneeskunde. 102(6), 469–71.

Zukile Ndyalivana has received his Honors degree in 2014 from the University of Fort Hare and has just completed his Master of Science degree at the same institution. His research interests include Natural Language Processing (NLP), Social Media, and Web development.

Post Author: admin


I'm Irma!

Would you like to get a custom essay? How about receiving a customized one?

Check it out