Page 3 of 8
Regional and international organizations
Development of local language computing applications and content requires a sustained effort. Many regional and international organizations have been contributing to this development across Asia Pacific. These organizations are involved in: (a) standards development and (b) technology development. Moreover, there are many funding agencies in the region that are supporting local language computing development, notably the International Development Research Centre (IDRC) of Canada, Center of the International Cooperation for Computerization (CICC) of Japan, National Institute of Information and Communications Technology (NICT) of Japan, United Nations (through UNESCO and the UNDP-Asia Pacific Development Information Programme or APDIP) and Asia IT&C Grants by the European Union.
This section lists some of the major regional standards and technology development organizations supporting local language computing in Asia Pacific and explains the role they play in this context. National and regional initiatives need to develop liaisons with these organizations, for example by subscribing to the multiple online discussion forums that they maintain or by attending the regular meetings, conferences and special workshops organized by them. Where funds are required, the funding organizations listed provide such support.
The Unicode consortium develops the Unicode standard, which is the standard encoding scheme for the multilingual Internet and is the same as ISO 10646. The consortium aims to provide standard encoding schemes for all characters and symbols used in different scripts for all languages of the world (Unicode 2006). In addition, it provides guidelines for collation, bidirectionality, reordering and line-breaking, which are fundamental to text processing for many Asian languages based on the Unicode standard. Even though conventional national and proprietary encodings are still being used, most nations across Asia Pacific are now switching to Unicode. In addition to encoding, the Unicode consortium has recently collected and is now maintaining locales for all languages through the CLDR project.
World Wide Web Consortium (W3C)
W3C develops guidelines, standards and software to publish multilingual online content. Its Internationalization Working Group is tasked with keeping these specifications multilingual. W3C maintains the HTML standard which is used for creating multilingual Web pages. In addition, it is developing SSML and VoiceML standards which are used for voice browsing, that is, accessing the Internet through speech. This organization is also developing multimodal content publishing standards for more effective Web accessibility, including access by people with disabilities.
Internet Corporation for Assigned Names and Numbers (ICANN)
Currently Web access requires typing a Web address (also called domain name or URL) in English. For populations who do not understand English, this is one of the significant hurdles in accessing online content. Web addresses, which are the key to entering the multilingual World Wide Web, should also be in local languages. ICANN is responsible for the global coordination of Web addresses16 and it recently introduced Internationalized Domain Names (IDNs) through reports RFC 3454, 3490, 3491 and 3492, collectively called the IDN Standards (ICANN 2006). IDN would allow Web addresses in local languages. However, due to the seven-bit ASCII-based domain name system, Unicode cannot be used and multi-lingual IDNs are converted to ASCII Compatible Encoding (ACE) before the address is resolved. Still being debated is how to enable Top-Level Domains (TLDs) in local languages and who will control them (Butt 2006; Huston 2006). Due to this continuing controversy, independent systems have also been developed, for example by the Chinese Internet Network Information Center (CNNIC). ICANN and IDNs are bound to play a critical role in making the multi-lingual Internet accessible.
International Standards Organization (ISO)
ISO jointly develops the ISO 10646 or Unicode standard with the Unicode Consortium. The technical committee TC37 develops standards for 'Terminology and Other Language and Content Resources', including specifications for lexica, corpora and other language content. The language resource standards are still being discussed and finalized and they are not currently in wide use. Some other related standards include ISO 3166 for country codes and ISO 639 for language codes, which are used for locale definitions by Unicode within CLDR and by other organizations including W3C and ICANN. For example, ur_PK represents the Urdu language locale as used in Pakistan.
Free and Open Source Software (FOSS) initiatives
Notable within software development initiatives for multilingual computing is the FOSS community which provides internationalized software applications that allow rapid localization covered under an open license.17 Most FOSS operating systems are based on Linux, are internationalized, and are being localized by different groups (for example, Debian, Red Hat and Ubuntu). Debian is currently being localized in more than 150 languages. Open Office, which provides a complete suite of document productivity software, is being localized into 70 languages. The Mozilla project distributes Firefox Web browser and Thunderbird email client. There are many more FOSS initiatives available online, including software for chatting, multimedia, Web development and database.
Asian Federation on Natural Language Processing (AFNLP)
Academic research forums in linguistics and language processing have long existed in many countries in Asia. However, there have been limited regional discussions on Asian languages. The American Association of Computational Linguists (AACL) and European Association of Computational Linguistics (EACL) have been providing a common platform for the Americas and Europe. A similar platform in Asia was created recently by bringing existing national organizations and conferences under a single regional umbrella called AFNLP. The federation is helping organize language computing research and development across Asia by providing a collaborative platform to share academic research and exchange innovative solutions for Asian languages. AFNLP holds a regular conference called International Joint Conference on Natural Language Processing (IJCNLP). Two such conferences have been held so far.
Language resources and vendor initiatives
Many organizations collect and distribute language resources that are essential to perform linguistic and computational research and to develop local language computing. The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes text and speech corpora, lexica and additional data for many languages, including Chinese, Arabic, Japanese, Hindi, Vietnamese, Tamil, Korean and other languages. The European Language Resource Association (ELRA) distributes similar resources for many Asian languages. Similarly, the Global Wordnet Association is developing lexical-semantic resources for many languages and the South Asian Language Resource Center (SALRC) at the University of Chicago is developing a repository of lexical resources for South Asian languages. No formal centre for the collection and distribution of the language resources of Asia Pacific has been established. However, discussions for establishing an Asian Language Resource Network, similar to LDC and ELRA, are underway. Another language resource organization is the Summer Institute of Linguistics (SIL), an organization of volunteers that has been documenting languages and populations for more than 50 years (see www.ethonologue.com).
The University of California at Berkeley has started the Script Encoding Initiative which is assisting individuals and groups to identify the missing characters, for example from lesser known languages, and helping them get these characters encoded in the Unicode standard.
Some corporations have also been involved in localization. IBM has developed a large repository of C++ and Java code which is called IBM International Components for Unicode (ICU). This library of code is available at http://icu.sourceforge.net/. Microsoft has restructured its localization policy and has started developing local language interfaces, called Language Interface Packs (LIP), which are currently available for seven Asian languages. These efforts will help develop basic localization at least in the languages that have official status in Asian countries or are otherwise commercially viable (for example, languages spoken by large populations).
There is growing interest in localizing the mobile platform, but the effort has mostly been taken up by the manufacturers themselves, for example, Nokia, Samsung, Sony and others. Text-based messaging is now increasingly becoming available through these systems for many Asian languages based on the Unicode standard. However, the localization is driven mostly by commercial interests focused on languages that promise revenues. It is not possible for independent developers to localize these platforms in other languages due to proprietary platforms and lack of open standards.18