Page 2 of 8
The process of localization
Localization requires three steps: linguistic analysis, basic localization and advanced application development. Linguistic analysis is required to unambiguously define the language conventions and norms that are to be modelled by technology. As implied, basic localization caters only to the rudimentary needs of end-users, including input and output of text in a local language. However, to give comprehensive access to novice users and illiterate populations, or to assist in content development in a local language, more advanced applications need to be developed. Further details are given in this section.
Successful language computing is largely dependent on good linguistic analysis based on cultural conventions. Very precise definitions are required for all relevant linguistic phenomena. However, for many languages in Asia Pacific, linguistic details are either incomplete or unavailable. Moreover, relevant cultural conventions are rarely documented. This poses a significant obstacle to localization and requires the involvement of indigenous expertise.
The initial linguistic details, which have to be agreed and standardized for basic localization, include (but are not limited to) the following: the writing system3 and character set used by the language for its publishing needs; the ordering of these characters; cultural conventions for representing numbers, time and the calendar; and translation of common terms used in the software interface. This has to be done by the appropriate language or cultural authorities at the national level. Experience shows that debate4 is inevitable in this process of standardization. It is important that the discussions and solutions be based on linguistic merit and not be driven by technology constraints, although all discussions must involve both linguists and technologists, the latter to challenge any ambiguities in the proposals from a technical perspective.
In addition, a detailed linguistic analysis of the script, speech and grammar of the language is required for advanced application development. The analysis encompasses the sound system of the language and its acoustic details, word and phrase structures, and the representation of meaning in the language. These details need to be clearly documented for eventual implementation, as further explained in this section.
Standardization and basic localization support
Once the discussions on the writing system and basic language details are finalized at the national level, the next step is to derive the relevant standards for computing and subsequently develop computer software and hardware to enable local language input and output based on these standards. At the minimum, encoding, keyboards (and input methods), fonts (and rendering engines), definition of cultural conventions (for time, calendar and numbers) and interface translation must to be enabled. Once defined, the keyboard, font and locale support must be incorporated in the operating systems (for example, Linux, Sun Unix, Microsoft Windows, IBM AIX, Apple Mac OS and others) and at least the basic applications, including word processors (for example, Emacs, GEdit, KEdit, Open Office, Word), e-mail clients (for example, Thunderbird, Outlook), Web browsers (for example, Firefox, Internet Explorer), chatting software, and the like, according to end-user requirements. These steps are briefly discussed below.
As computers can only manipulate numbers and not characters, to process a language each character in it has to be assigned a unique number.5 This process is called encoding. The process can be done in a non-standard way by arbitrarily assigning numbers to different letters in the language.6 However, non-standard encoding inhibits data sharing across multi-user applications, including Web access, e-mailing and chatting. Therefore, the encoding should be done through the international standard ISO 10646 or Unicode.7 If the Unicode standard does not support a language, or only partially supports it, this standard should be enhanced by submitting a proposal, channelled through appropriate national bodies, to add new characters.8
Even if standardization is achieved, there still remains a large repository of information based on arbitrary encodings. Thus, in addition to standardization, additional file-mapping applications need to be developed that will allow the legacy or concurrent content in other encodings to be converted to the standardized encoding.
Keyboard and input method
Once the character set is standardized, keyboard mapping—that is, the placement of characters on the keyboard—needs to be defined. This mapping can be facilitated by extending existing keyboard layouts or doing character frequency analysis.9 Some languages require complex input methods. For example, because it is not possible to put the thousands of Chinese symbols on the keyboard, different methods based on strokes, Latin character transliteration, and handwriting recognition are used to input text in Chinese (Wikipedia 2006; Hussain et al. 2005). Input methods must be openly and consistently defined and openly standardized to allow users to type in the same way across all computing systems.
Fonts and rendering
Fonts for languages are required for on-screen display and printing. Simpler writing systems like Latin and Cyrillic scripts (which are used for most languages spoken in the Americas and Europe) have been modelled by earlier font formats, such as the True Type Fonts (TTF). However, most scripts used for languages in Asia Pacific are more complex due to their cursive nature and context-sensitive character shaping and positioning (Hussain 2004) and therefore require the enhanced font formats, such as the Open Type Fonts (OTF).10 Once fonts are created for a language, computer software is used to display the fonts onscreen, in a process called rendering. Complex writing systems, such as the Nastalique writing style for the Urdu language (Hussain 2004), require a sophisticated rendering engine capable of displaying the font.
The locale for a language contains information about the local language and the cultural representation of time, calendar, numbers and other related information normally visible on computer screens—for example, in English the date stamp '4/1/2005' is usually included with e-mail messages. This represents '1st of April 2005' in the USA but '4th of January 2005' in the UK. Thus, to define and to interpret this information, the language and region of the locale must be clearly declared. In addition to time, date and digit conventions, the locale also defines the sequence in which the words in a language are ordered, which is very important for many applications, for example, to develop a voter list or to make a telephone directory.
The locale may be defined by filling in a given template and submitting it to the Common Locale Data Repository (CLDR) managed by the Unicode Consortium. The locale for each language for each country is defined separately to capture cultural variations, such as bn-BG and bn-IN for the Bengali language (bn) spoken in Bangladesh (BG) and India (IN), respectively. Many Asia Pacific countries have not developed or registered their language locales with CLDR.
Local language interface
Imagine giving a Nepali speaker a computer that is configured for use in the Japanese language. Such a computer would be impossible for the Nepali speaker to operate because he or she cannot comprehend the words and phrases displayed on the screen. For majority of users in Asia Pacific who do not understand a foreign language, words and phrases like 'save', 'print', 'edit', 'file' and the like, need to be translated and displayed in the local language on the computer screen. About 5,000 words comprise the basic glossary to represent menu items for operating systems and basic applications. However, to completely localize all help files and error messages, careful translation of more than 300,000 phases may be required.
Translating a glossary is challenging because there are many words that do not have local language equivalents, such as the word 'cursor'. Either such words are transliterated or new senses of the existing local language words need to be formulated. This creative exercise requires language experts who are proficient in the use of computers, a rare combination of skills in the developing Asia Pacific region.
Once translated, the basic glossary should be verified by language authorities and published as a national standard (for example, DzongkaLinux Team 2007) and supplied to vendors and organizations (for example, Debian, Red Hat, Microsoft, IBM Apple) for incorporation within their platforms.
Basic application localization
Once the basic linguistic analysis is completed and localization support is developed, this support will need to be integrated at two levels. First, the support must be included in the basic operating system being used, for example, Linux, Microsoft Windows, Apple MacOS, IBM AIX, Sun Unix. The operating system would enable the encoding, allow the locale of the language to be defined, and allow the input and output methods to be used effectively. Interface translation in the operating system must also be enabled. Second, once the operating system is enabled, basic applications must be localized. These applications include word processors, e-mail clients, Web browsers, chat clients and other general and customized applications. However, this only provides basic access to trained users. For wider, more effective access for general users, advanced local language computing applications will also need to be developed.
Advanced language computing applications
Basic localization should not be the final goal because it does not completely meet the objective of giving end-users meaningful access to computing. Advanced language technology is required to further facilitate access for end-users and enable them to generate local language content. Advanced language computing requires in-depth speech and linguistic analysis as well as complex programming for implementation, drawing from the fields of phonetics, phonology, morphology, syntax, semantics, signal and speech processing, image processing, language processing, artificial intelligence and statistics. Moreover, a significant amount of local language resources is needed to develop these applications, as further explained in the following sections.
Language resources are required by advanced applications to create language models. These resources include first, a list of words in the language tagged with minimal linguistic information (for example, part-of-speech [POS],11 gender, number).12 These word lists (or lexicons) are needed to develop applications like spelling checkers. Many applications would also require a large amount of typed text in the language, called the language corpus. This is used to extract word frequencies, word collocations and other grammatical information for statistical language processing. A corpus of 10–100 million words from different text genres is required for different kinds of statistical modelling.
Part of the corpus must also be manually tagged with POS and other linguistic information to infer automatic models for processing text through machine learning13 techniques. For example, a text corpus manually tagged with POS is used to develop an automatic POS tagger. The POS tagger is used in almost all advanced applications, for example, to decide whether to stress the first or second syllable of a word like 'address'14 for a text-to-speech system. The Urdu language shows a similar variation, for example, for the word (ulta; 'upside down' vs. 'to turn upside down'). Another such critical system is for word segmentation, since in Asia Pacific, many languages like Chinese, Dzongkha, Khmer, Lao, Thai, Urdu and Burmese do not use spaces between words, which makes it difficult to determine word boundaries in typed text. The word boundaries have to be guessed based on advanced linguistic and statistical techniques. Solving this problem is fundamental for any further processing of these languages through machines, for example, doing line-wrapping in word processing or performing spell checking. In addition to tagging text corpora, the computational grammars of these languages need to be developed and documented.
Speech corpora are required for developing speech applications. These must be recorded for narrative and conventional speech over different channels, including microphone, telephone, mobile phone, and so on, for a variety of speakers and dialects, for the development of the speech applications. Finally, script corpora need to be developed for script processing applications. The corpora must include large samples of different types and handwritings and the corpora must be manually tagged for various linguistic dimensions.
Once the language resources are available, they can lead to the development of advanced applications, which can be broadly categorized into two sets: those which provide access to existing content and others which assist in generating new content in local languages.
Applications to provide access to information
As discussed, basic applications like word processors, e-mail clients, Web browsers and chat clients provide basic access to trained users, once they are localized. However, there are additional applications that can be used to further enrich the computing experience. Most of the population in developing Asia is illiterate and enabling computing in local languages still does not provide this population access to online information that is otherwise available to literate individuals. They need a speech interface, which reads out online text to users (text-to-speech systems or TTS), as well as technology to 'listen' to users (robust automatic speech recognition systems or ASR) and to interpret their requests (language understanding systems). Also needed are search engines and advanced information retrieval (IR) systems that can sift through existing online data and seek out and display requested information. All these must be possible in local languages. While there are generic software programs with open licenses which are already available, these programs have to be trained (and sometimes enhanced) for Asian languages.
Once core technology like TTS, ASR and IR is enabled, it has to be integrated into Interactive Voice Response (IVR) and other dialogue-based systems to 'communicate' with end-users. As the core technologies are developed by a variety of vendors, standardized ways of integrating these technologies will need to be developed. There are ongoing standardization efforts: for example, the World Wide Web Consortium (W3C) is developing Speech Synthesis Mark-up Language (SSML) and Voice Mark-up Language (VoiceML) to allow voice browsing, in addition to the widely used text browsing standard called Hyper Text Mark-up Language (HTML). Voice browsing allows users to interact with a website to access content using speech interface. This can greatly enhance Web use in developing Asia, especially within the illiterate and visually impaired community.
Applications for content generation
Even if access is possible, it is still necessary to have relevant content available in local languages for end-users. At present very limited online content is available in the languages of Asia Pacific. There are three general ways to generate online content: (a) develop original content, (b) copy content from printed sources in local languages and (c) translate existing content in a foreign language. The localized common applications used for access, such as word processors, email clients, Web development tools and chatting software, may be used for content generation as well.
Although online content development is a slow process, script and language technology can accelerate it. And although there is little online content in Asia Pacific languages, there is a lot of printed content. Using Optical Character Recognition (OCR) systems, which scan printed documents and books and automatically convert the images to editable text, this printed material can be quickly transformed into searchable online content.
In addition to content in the local languages, there is also a large amount of universally useful content available in foreign languages, including English (35.2 per cent), Chinese (13.7 per cent) and Japanese (9 per cent) (Global Reach 2004). This content can also be translated to local languages quickly by developing automatic Machine Translation (MT) systems. Automatic translation, although not very accurate, provides access to content that is otherwise completely inaccessible. Automatic translation can be made more accurate with human assistance (where required) at a significantly lower cost compared to a completely manual translation.
TTS, ASR, OCR and MT are advanced applications that require considerable language resources and linguistic and computational analysis. These applications also require dedicated input from specialized human resources over a considerable period. An MT application could take a team of 10 linguists and computational linguists five years to develop.15 Usable TTS, ASR and OCR systems could take a team of 10 linguists, engineers and computational linguists three years each to develop. To mature and perfect these applications would require continuous focus for an even longer period.
Licensing is an additional problem with online content, even where technology may be available for accelerated online publishing. Much of the content available is normally copyright, which makes it difficult to disseminate. Newer regimes that allow much more open use of content, such as Creative Commons, are emerging. Wikipedia, which allows free-for-all information and is available in many languages, is an excellent outcome of these movements towards open content.