Building the International Corpus of Arabic (ICA)
Bibliotheca Alexandrina (BA) is one of the international Egyptian organizations that play a significant role in disseminating culture and knowledge and supporting scientific research. It initiated a leading project to build the “International Corpus of Arabic (ICA)”, an ambitious attempt to build a representative corpus of the Arabic language as it is used all over the Arab world, with the aim of supporting research on the Arabic language.
The ICA is a step-by-step guide to create and analyze Arabic linguistic corpora. Once finished, the analyzed version of ICA will be the first planned analyzed corpus available as a linguistic resource for researchers. It is also the first systematic investigation of national varieties within the Arabic language, this should prove very useful for linguists who believe that their theories and descriptions of Arabic should be based on real, rather than contrived, data.
Alansary et al. started the collection of the ICA in 2006. It should ultimately contain 100 million words. The collection of samples is of written Modern Standard Arabic (MSA) selected from a wide range of sources representing a wide cross-section of regional variety in the Arabic language.
In collecting a representative corpus of the Arabic Language, our main focus was to cover the same genres from different sources from all Arab countries. Therefore, the ICA includes:
1. Diverse sources; Newspapers, web articles, books… etc.
Some of these sources are divided into sub sources. for example, the genre “press” is divided into “Newspapers”, “Electronic Press” and “magazines” which is subsequently divided into “ General” and “Specialized” magazines.
2. Diverse genres; Literature, Politics, Sciences…etc.
Some genres are also divided into sub-genres. for example, the genre “Literature” is divided into “Prose”, “Poetry” and “Studies of Linguistics and Literature”, the sub-genre “prose” is further divided into “Novels”, “Short Stories”, “Child Stories” and “Plays”.
The following are some of the criteria we borne in mind when collecting the required data:
1. Different sources and genres should be weighed in proportion to how common they are.
2. The number of categories the corpus should contain, and the number of texts in each category and the number of words in each sample weighed.
In designing the ICA, we tried to arrive at a design that would make searching within the corpus as economic and easy as possible. The design chosen was to break up the corpus into the different sources (books, newspapers…etc.), and subsequently break up these sources into the various genres (Literature, Sciences…etc.). In addition, A careful record of a variety of variables is kept with every text; when and where the text was written and published, its source and its genre.
The ICA software has been built to upload ICA texts and help researchers interested in reliable data about the Arabic language interrogate the corpus by providing an overview of the corpus (showing the document’s position in the hierarchy of the ICA corpus) and presenting all the corresponding information at their request.
It also provides different search options and search methods. Search options enable the user to select the area within the corpus in which he wants to search for the required information: the current document, the current location within the corpus hierarchy or the whole corpus. Another search option enables the user to select the view option in which his results are displayed; in context or separately.
The ICA software also provides different search methods; Exact Match, Wildcard or Regular Expression.
The stem-based approach (concatenative approach) has been adopted as the linguistic approach to analyze the ICA. Buckwalter morphological analyzer has been chosen to analyze the ICA since it provides a lot of information such as Lemma, Vocalization, Part of Speech (POS), Gloss, Prefix(s), Stem, Word class, Suffix(s), Number, Gender, Definiteness and Case. The analysis stage takes place over three stages:
1. Disambiguating word senses: in this stage, the suitable analysis for each word is chosen according to its context.
2. Modifying and adding linguistic information: in this stage the ICA team either manually corrects some information in Buckwalter’s output according to their morphosyntactic properties, such as Gender, Number and Definiteness, or adds some information that exceed the scope of Buckwalter’s analysis such as the features (NOUN_PROP) for proper nouns, (ADV_T) for time adverbs, (ADV_P) for place adverbs and (ADV_M) for manner adverbs. Root information and Word Pattern are also added for more accuracy.
3. Manually analyzing unanalyzed words:Words are analyzed manually according to their contexts, in the same manner they would be if they have been analyzed automatically when:
1. Buckwalter's analyzer does not provide any solution. This usually happens with Colloquial words, Loan words or Commonly used Non-Arabic words,
2. Buckwalter's analyzer does not provide the solution suitable for the word’s context.