Selasa, 11 September 2012

Artificial Intelligence and Basic English: A New Approach to Translation

I find that one of the greatest challenges of our time is that of communications. However, once communication is achieved between individuals, the next step is that of understanding. The English language is the predominant language for human interaction across the whole world today: 55% of the Internet content is in English, most of the top Universities, business transactions, entertainment as well as scientific and technological publications use English. At a technical level of telecommunications technology, it is one of the fastest growing industries.

Many translating tools are available to translate between languages but they are in general of very low quality due to the complexities of correctly deriving the equivalence between words and phrases in two distinct natural languages. Translation is difficult for numerous reasons, including: 

the lack of one-to-one word correspondences among languages
the existence in every language of homonyms
the fact that natural grammars are idiosyncratic
They do not conform to an exact set of rules that would facilitate direct, word-to-word substitution. It is toward a computational "understanding" of these idiosyncrasies that many artificial intelligence research efforts have been directed, and their limited success testifies to the complexity of the problem. An alternative is to interact in a language which is widely understood and which many people wish to learn, even if at a basic conversational level in order to interact and be entertained, as is the case with the English language. The difficulty then arises of how to assimilate complex material even if only a colloquial level of knowledge is available.

Chinese writing for instance possesses more than 40,000 mainly ideographic signs, but knowledge of four thousand is enough for most purposes. Chinese writing, insofar as it is phonetic, is also monosyllabic, for the very good reason that the words of the language consist of only one syllable, with a large number of homophones, which made it important to have signs that distinguished between these homophones, and so the script avoided being purely phonetic. Even in this case, early simplification such as the one performed by James Yen in 1923, resulted in a selection of 1,200 of the traditional characters, in order to form what can be called Basic Chinese, enabling illiterate people to read in this system after four months work. A later refinement by Yuan Chao produced a system of about 2,500 of the traditional characters, which, it was claimed, can cover basically all of the language. The Japanese resolved the basic linguistic problem by adding Hira Gana. Children are taught 1,200 from 40,000 symbols, which often contain a Chinese root and suffixes.

Another attempt at devising a simplified version of a language is that of Basic English, as proposed by Charles K. Ogden in the 1920s. The fact that it is possible to say almost everything we normally wish to say with 850 words, makes Basic English something extremely attractive. By the addition of 100 words required for any general field such as Sciences, and 50 internationally recognized words, a total of 1,000 words enable successful communication.

Imagine now that, in stead of translating a complex, technical text into your own language, you are able to simplify its vocabulary in order to be able to understand the words and, therefore, the essence of the content. This is an other approach of translation which should be referred to as conversion since the vocabulary is converted into a subset of the same language. This is how I see the future of translation: converting content into a reduced-vocabulary representation of the same language to simplify it and be able to understand it.

Clearly, where complex or ambiguous material is being turned from English into a reduced-vocabulary representation, there will be some loss of semantic content. However, material of a legal, business, scientific and technological nature is normally specifically produced in a way that seeks to be both precise and clear, and is therefore amenable to a reduced-vocabulary representation. On Internet, on the other hand, if we consider scientific and technological words, the required vocabulary comes closer to 100,000 words and is therefore well beyond the capacity of the average English-as-a foreign-language student. Hence, using Basic English to (re-)define those words allows people with a basic knowledge of English to understand almost 100,00 words knowing only 1,000 of them.

Simplish has implemented an automatic translation tool, based on converting Standard English into Basic English, so that a user with even a basic conversational level of English can understand English content however complex. For the case of more complex scientific words, these are explained wherever they occur in footnotes using these 1,000 basic words. Simplish can be used for free to process texts of less than 500 words, whereas registered users can process files up to 25,000 words, have some space for personal files on the server, as well as add words to a personal dictionary so the system can adapt to each user's level of knowledge.