Problem definition:
The problem of identifying proper names is particularly difficult for Arabic, since:
- Non-Vocalization: It is due to a lack of short vowels in usual texts from which a high degree of ambiguity ensues. In theory, only the Koran, and children’s books are fully vowelled.
- Lack of capitalization: The problem of identifying named entities is particularly difficult for Arabic, since names in the Arabic language do not start with capital letters and, therefore, we cannot mark them in the text by looking at the first letter of the word.
- Delimitation problems: They are related to the lack of information about unknown words with NEs, an antonomastic usage where proper names are substituted with a phrase or conversely as well as the presence of some homonyms which increases ambiguity when trying to mark NE constituents.
Output: Database contains all the names found in the articles.
Approach:
First we mark the phrases that might include names, second we build graphs to represent the words in these phrases and the relationships between them and third we apply rules to find the names.
1. Mark the phrases that might include names:
To tag the name phrases in the text we look for the keywords and special verbs in the text to mark the name phrases. We assume the name should not be more than three words away from the keyword or the special verb. We also assume that the longest name is 7 words so we mark 10 words to the left of the keyword/special verb and 10 words to the right of the keyword to identify the name phrases.
2. Build graphs:
We use directed graphs to represent the words found in the name phrases, the relative frequency (weight) of each, and the relationship between them. The nodes in the graph represent the words and the edges represent the relationships (weight) between them. The relationship (weight) between two nodes represents the number of times these two words are mentioned next to each other in the name phrase.
For the organization, event and location classes we built one graph for each keyword, for the people class we built one graph for all keywords because a name for a certain person might be connected to different titles in different articles.
3. Apply rules:
After we find a certain name we apply the following formula to confirm it.
| Name | * Weight ( Name ) > R1
- Where: | Name |: length of the name (number of words forming the name).
- Weight (Name): number of times the name appears in the text.
Name Classifier:
After we find the name we classify it with respect to its major class and its subclass.
- Major class: people, organization, location, product, etc.
- Sub-class: president, mister, commander, professor, bank, store, city, state, camp, etc.
We use the following equation to classify the names:
pos (Name | KWi) >= R2
and
(pos (Name | KWi)) / (pos (Name | KWi) + neg (Name | KWi)) >= R3
Where:
pos (Name | KWi): number of times the name found attached to the keyword KWi.
neg (Name | KWi): number of times the name is found attached to keywords other than KWi.
Results:
The new technique have been tested on 500 articles from the Al-Raya newspaper (2003), published in Qatar. The module identified 335 names, missed 92 names, and extracted 8 names mistakenly.
Reference:
Saleem Abuleil, 2003 "Extracting Names From Arabic Text For Question-Answering Systems" Chicago State University MIS Department. here