Page 4 of 8
Status of language technology
Many of the basic standards and applications have already been developed for most of the national languages in Asia Pacific. Many of these standards have been reviewed over time and now align with international standards. However, language computing has matured to different levels in these countries. This section summarizes the status of localization of national languages in different countries in Asia Pacific. There are five levels of maturity that are at best qualitative as it is difficult to make a quantitative assessment (because each country is confronted with its own unique socio-economic, political and linguistic challenges, for example). The comparison is based on the level of work on the national language and research and development capacity in the areas of script, speech and language processing. A checklist of these applications for many national languages from the region is also provided in Table 1. For more information, see Sonlertlamvanich (2002), Tsujii (2005) and Hussain et al. (2005).
Note: The table lists a comparison for some of the applications. The comparison is qualitative, not quantitative, and is based on the current information available to the authors through the Internet and other sources (for example, Sonlertlamvanich 2002; Tsujii 2005; Hussain et al. 2005). The information has not been independently verified and therefore has some margin of error. (blank—minimal work; x—initial work started; xx—some work completed; xxx—much work completed; for Year 2006)
Highly localized languages
Leading the development and implementation of local language computing are the more developed countries in the region, including China, Japan and Korea. These countries are very active in international standardization efforts and participate in relevant platforms and discussions. Most software is already localized in Mandarin Chinese, Japanese and Korean. Current research and development is focused on cutting-edge technology, including speech-to-speech translation, as basic localization and advanced applications, including TTS, ASR, OCR and MT, are already developed and available through the commercial sector. These countries have active academic bodies collaborating with the commercial sector, backed by governmental policy and support. Some of the organizations involved are the University of Peking, City University of Hong Kong, Academia Sinica in Taiwan, NICT and Advanced Telecommunications Research Institute International (ATR) in Japan, and Korean Advanced Institute of Science and Technology (KAIST) and Electronics and Telecommunications Research Institute (ETRI) of Korea. Significant research and development is being performed by the commercial sector as well, including Sony, NEC, IBM, Nokia, Microsoft, Hewlett-Packard, Systrans and so on.
Very localized languages
Thailand and India are also very active in local language computing. The National Electronics and Computer Technology Center (NECTEC) of the National Science Technology Development Agency (NSTDA), along with Thai industry and academia, is leading the full localization of the Thai language. A Thai OCR, text-to-speech system, and English-Thai MT are now available. The Thai Language Environment (TLE) project develops and maintains the Open Source Thai Linux distribution.
India also has a thriving and vibrant language computing development sector. The Ministry of Science and Technology has created the Technology Development for Indian Languages (TDIL) department which supports and coordinates active research on Hindi and many other constitutionally recognized languages through research centres at Indian universities and the Centre for Development of Advanced Computing (CDAC). In addition, the IndLinux group localizes Linux distributions in many languages (MIT 2006) and has released the Hindi version. However, commercial-grade applications for end-users are not fully developed and not in wide use due to the complexity and language diversity (currently 22 official languages). Nevertheless, working models of TTS, MT, ASR and OCR for a few languages, including Hindi, Tamil and Marathi, are available. Other language resources, including lexica and corpora, are also available. Government focus and a dynamic language policy are providing the correct impetus and India is seeing an emerging localization and language computing industry.
Moderately localized languages
Indonesia, Malaysia, Pakistan, Sri Lanka and Vietnam have fairly active academic research and development programmes and fairly mature standards and basic language applications, with reasonable work in advanced applications.
Research and development in Indonesia is being carried out by both the public and academic sectors. Basic resources and advanced applications are all being developed with advanced prototypes already released. Badan Pengkajian dan Penerapan Teknologi (BPPT) and the University of Indonesia are two organizations actively involved in this process. Most of the work is on Bahasa Indonesia.
Research in Malaysia started in 1987 through the KANTA project by CICC which developed an MT system for Japanese, Malay, Chinese, Thai and Bahasa Indonesia. Universities, including Universiti Teknologi Malaysia and Universiti Sains Malaysia, are actively involved in research and development.
Localization in Sri Lanka is being led by the University of Colombo School of Computing for Sinhala and Tamil, with support and guidance from the ICT Agency of Sri Lanka. The open source community is also reasonably active through Sri Lanka's Linux User Group (LkLUG), which has made some progress on the development of a Sinhala Linux distribution.
In Vietnam, localization is being led by the Ministry of IT and is also being carried out in some universities. VietKey is an open source office productivity software available in Vietnamese. Work is also underway on advanced applications, like ASR.
Pakistan has shown a promising focus on language computing (see the boxed case study below).
However, very limited development work is being carried out by the commercial sector in these countries, especially for advanced applications.
Somewhat localized languages
The national languages of countries like Bangladesh, Myanmar and Nepal belong to this category. In these countries there is an emerging realization of the importance of local language computing and focused public policy is starting to develop, integrate and align existing private initiatives. However, there is only limited work on advanced language computing applications.
Countries like Afghanistan, Lao PDR, Cambodia, Mongolia and Bhutan are also starting to develop basic localization standards and applications in their national languages.
Of the approximately 3,500 languages spoken in Asia Pacific, only about 30–40 languages are being localized. Small and developing language communities are left out due to very limited capacity to perform indigenous localization and lack of commercial incentives. This problem is especially severe for countries with exceptionally high linguistic diversity, such as Papua New Guinea (820 languages) and Indonesia (737 languages). Localizing these languages will only be possible through long-term policy initiatives and collaborative effort between national, regional and international organizations.