Monday, October 10, 2016

"Wogbɛ Jɛkɛ" & Ghanaian language input support

Came across mention on Twitter of the Ghanaian play "Wogbɛ Jɛkɛ - A Tale of Two Men" but with the Ga words in the title written "Wogb3 j3k3":
In fact, looking at Twitter and at the web via a Google search, one notes both this workaround and the correct spelling, as well as the ASCIIfied version, "wogbe jeke."

7 vowels and a 5 vowel keyboard

Ga, a Ga-Dangme language of southernmost Ghana, has a complex vowel system, with seven vowels distinguished in its writing system: a; e; i; o; and u; plus ɛ ("open e") and ɔ ("open o"). The latter two are used to write many other African languages such as Akan, Ewe, Mende, Bambara, and Lingala.1 (These characters, like a number of other Latin letters, are also in the International Phonetic Alphabet.)

Many fonts include the ɛ and ɔ, however typing them is not facilitated by standard keyboards. There are keyboard layouts specially conceived for Ga (see below for a list), as well as for Akan, Ewe, and others. However, there apparently are not any keyboards to enable multilingual input - such as an Akan title included in a tweet in English. Or if there are, they are not widely used. Hence resort to "3" for "ɛ" and ")" (the right parentheses) for "ɔ."

In African Languages in a Digital Age (p. 61) I outlined several workarounds for text including extended Latin characters not supported in fonts or input systems, a summary that was a revision of something published a decade earlier.2 I had not, however, noted the use of numbers or symbols among the "substitution solutions." Ade Sawyerr, who has worked with Ga input issues, mentions observing these particular substitutions - "3" and ")" - as well as others, such as "rj" for the letter "ŋ" ("eng"), which is also used in Ga.

In any event, the resort in the mid-2010s to 3's and )'s to type words in languages like Ga, Akan, and Ewe that use them is evidence of missing input options on the devices used, or inconvenience of existing options, or perhaps lack of awareness of available keyboard apps on the part of users.

Some keyboard layouts for Ga

Over the last couple of decades, and especially since the availability of keyboard utilities like Keyman and Microsoft Keyboard Layout Creator (MSKLC), there have been many keyboard layouts developed for languages such as those of Ghana that have extended Latin orthographies. A full discussion is beyond this blog post, but generally speaking, keyboards incorporating characters not on the standard computer keyboards work either through changing key assignments (such as "q" is not used in Ga, so "ŋ" is substituted for it) or via a combination or sequence of key strokes. The solution with changed keys seems to be more common on mobile device applications, whereas both approaches are found in keyboard layouts used on computers.

Kasahorow Android keyboards
menu selection
A selection of Ga keyboards:
There likely are others for Ga (and the closely related Dangme). There definitely are a number for other languages of Ghana such as Akan (or its varieties, Twi Ashanti, Twi Akuapem, and Fante), Ewe, and Dagaare.

However, more could be done to facilitate multilingual typing, so that one doesn't have to switch keyboards or keep track of key sequences to insert something like Wogbɛ Jɛkɛ in an English tweet, or say a Hausa word with a hooked letter in a text in Akan (hooked letters are not part of the Akan orthography). Could for example an extra line of keys be added to touchscreen keyboards - say on a Ghana English keyboard - with the extra characters needed for Ghanaian languages?

About "Wogbɛ Jɛkɛ"

Wogbɛ jɛkɛ is a Ga term with meanings of "we have come from far" and "our journey is still long." It is used in the title of two plays written by Chief Abdul Moomen Muslim about the historical events, beginning with "Wogbɛ Jɛkɛ: Birth of a Nation," which depicts pre-colonial history of what is now Ghana, and followed by "Wogbɛ Jɛkɛ: The Tale of Two Men," which is centered around the stories of J.B. Danquah and Kwame Nkrumah during Ghana's independence struggle.

1. Some Nigerian languages like Yoruba and Igbo instead use sub-dotted characters - and - for these vowels.
2. Don Osborn, 2001, "The knotty problem of using African languages for e-mail and internet," Balancing Act News Update, 69.

Friday, September 30, 2016

Internationalizing computer science in Africa

Last year I posted on whether Unicode and internationalization (i18n) is included in any computer science curriculum in Africa. A recent comment to that post by Andre Schappo asking whether there are any organizations in Africa promoting internationalization of university curricula more generally offers another angle to approach this issue.

Part of Unicode charts for Ethiopic/Ge'ez
Andre's question follows a post on his blog about two organizations that promote internationalization of teaching curricula, one in the UK and the other in Australia. Depending on how one defines promotion of internationalization in higher education, one might add many other initiatives and consortia which seek in one way or another to develop and support international or global studies. The degree to which such efforts overlap with or might impact the content of computer science courses is an interesting question. In my limited experience, international/global studies mainly addresses disciplines in other areas (social sciences, humanities, certain applied disciplines). It certainly is worth asking how a program of internationalization at a university would apply to computer science and see how the discussion goes.

However, in the case of Africa - and also Asia - internationalization of the computer science curriculum would seem to follow as much from attention to localization as to international and global perspectives.

In any event, this issue of how Unicode and i18n figure in computer science instruction - worldwide as well as in Africa - is one that is important for technical and language planning reasons as well as for the same reasons that motivate attention to internationalization in the higher education generally.

Thursday, September 08, 2016

International Literacy Day: Let them write!

One of the most common objections I have heard from international development colleagues about literacy training in African languages is "What will they read?" While it is true that relatively little is published in some African languages, and next to nothing in others, such a view has problems on several levels. For example, it's easier to learn in one's first language, literacy skills in one language facilitate learning other languages, and there is a cultural cost to always and only associating formal learning with a Europhone second language. But one of the most important in my opinion, and one that I have offered as a primary defense of literacy in first languages of Africa, is that neo-literates* can write - maybe just a little, like a ledger, or maybe a lot, in stories that express and communicate in their own way.

So it is a pleasure to see the theme for this year's International Literacy Day (ILD; 8 September 2016): "Reading the Past, Writing the Future."

Are there examples of newly literate people in Africa writing in African languages? Yes of course. One is the Senegalese organization Associates in Research and Education for Development (ARED), which has actually published writing by its students. I have also heard of literacy students just writing with this new tool. There are certainly many more.

With the association of literacy with goals of "lifelong learning" - per the 2030 Agenda for Sustainable Development - there should be a way to support and encourage neo-literate writing in first languages on a wider and more systematic basis. Not just for fun, though hopefully at least that, but for adding many diverse voices to writing the future.

Additional notes

Two African organizations were recognized this year with the UNESCO Confucius Prize for Literacy (which along with the King Sejong Literacy Prize are awarded annually on ILD):
  • the South African Department of Basic Education’s ‘Kha Ri Gude Mass Literacy Campaign
  • the Direction de l’alphabétisation et des langues nationales in Senegal for its ‘National Education Programme for Illiterate Youth and Adults through ICTs
Both programs sound interesting. I'd like to know more about how the Senegalese program used its national languages (and which ones) in ICT.

For a very interesting discussion of ILD from Malawi, see Steve Sharra's blog, Afrika Aphukira: Literacy, Language and Power: Thoughts on International Literacy Day 2016

* "A neo literate is an individual who has completed a basic literacy training programme and has demonstrated the ability and willingness to continue to learn on his or her own using the skills and knowledge attained without the direct guidance of a literacy teacher." APPEAL - Training Materials for Continuing Education Personnel (ATLP-CE) - Volume 2: Post-Literacy Programmes (APEID - UNESCO, 1993, 112 p.)

Tuesday, September 06, 2016

VOA Hausa Digital Content Editor

The Voice of America (VOA) is hiring a Digital Content Editor for its Hausa service. Normally I do not post jobs on Beyond Niamey, but rather do so occasionally on the Facebook African languages group. In this case I am making an exception since it seems that the person hired by VOA will be in a position to possibly help the organization finally move its Hausa web content from an ASCIIfied version to the Boko orthography - a topic that has been discussed previously on this blog.

Links to the position announcement are below, but first a quick review of the issue. The Latin-based "Boko" alphabet for Hausa includes several modified letters (technically called "extended characters") that stand for sounds not represented in the alphabet as used in English, French or other European languages. Sometimes called "hooked letters" they include: ɓ ; ɗ ; ƙ ; and in Niger, ƴ - in Nigeria 'y is written for the same sound as the last one. The capital letter forms of the four hooked letters are Ɓ Ɗ Ƙ Ƴ.

When VOA and other international radio services - notably BBC, CRI, and RDW - began websites for their respective Hausa services, the Unicode standard that facilitates display of extended Latin characters and diverse writing systems on the internet, was not in widespread use (RFI added its Hausa service later). Evidently this was the reason for resort to an ASCIIfied rendering of Hausa text (with b, d, k, and y instead of the hooked characters, which can change meanings) - older systems then in use among the audience may not have been able to handle the Unicode-encoded hooked letters.

That argument is losing credence, if it is not already meaningless. The number of systems in use old enough not to have Unicode fonts (now the norm but the earliest of them were already in systems over a decade ago) must be very few. Moreover all the 5 international radio Hausa sites use UTF-8, which displays Unicode.

So what is the current state of use of the Boko orthography (with the hooked letters) on the five sites - VOA, BBC, CRI, RDW, and RFI? I used a new way of evaluating them - actually bringing back an old trick - which is to search just the letters on the sites with Google. The best way is to use Google advanced search, or just put a sequence like this in the search window of the usual Google page:

ƙ OR ɓ OR ɗ OR ƴ

This pulls up all pages on the site with at least one of these hooked letters. You can substitute the domain of the site you want to evaluate. My results were: BBC 16 pages; RDW 7 pages; VOA, CRI, and RFI all 0. Not impressive.

What's holding them back? Inertia? Lack of a keyboard layout to easily type with the hooked letters? Lack of a spell checker for Hausa in Boko orthography?

In any event, the new Digital Content Editor for the VOA Hausa service would be in a position to make a significant contribution to that service's web content, with secondary effects on other Hausa language websites.

The position has two listings on the site: one for US citizens; and one for non-US citizens. (This sort of dual listing is normal; you see it also sometimes for internal candidates in an agency and for external candidates applying from outside the agency.) The position was announced today, 9/6/16, and closes 9/20/16.

Saturday, September 03, 2016

Facebook, ISOC, and A12n

In his recent visit to Lagos, Nigeria, Facebook founder and CEO Mark Zuckerberg indicated that Facebook will add more African language interfaces. Meanwhile, at the African Peering and Interconnection Forum (AfPIF2016) in Dar es Salaam, Tanzania, the Internet Society (ISOC) released a report entitled "Promoting Content in Africa," which highlights the importance of internet content in African language for full access by Africans.

These two developments concerning on the one hand localization of the software for a popular social media platform, and on the other hand the creation of content, highlight the dual aspects of Africanization (A12n) of information and communication technology in/for Africa. As these processes develop, it would be useful for to find ways to integrate them as appropriate, and foster collaboration among organizations and individuals involved in either or both. (That was the intent of the African Network for Localisation, ANLoc, albeit with a focus mainly on the software and enabling aspects.)

It is possible, as the ISOC report notes, for content to be developed or translated in a language even when the software on which it is created is not localized in it. And that certainly would be the case for the less widely spoken languages, at least in the near term. However, the availability of software interfaces - whether for social media like Facebook or for production software - in at least the major African languages, would probably help even for the less-spoken ones.

Facebook sign-up in Hausa. (Source:
Facebook currently is available in the following African languages (links are to Wikipedia articles): Afrikaans; Arabic; Hausa; Kinyarwanda; Malagasy; Somali; Swahili; and Tamazight

One of the contributors to the ISOC report, Dawit Bekele, who is ISOC's African Bureau Director, was a participant in the PanAfrican Localisation Workshop in Casablanca, June 2005, and the Pan African Research on L10N Workshop & Localization Blitz in Marrakech, February 2007.

Wednesday, August 31, 2016

Missing "macrolanguages" of Africa

Screenshot from VOA's Kinyarwanda/Kirundi site
The Voice of America (VOA) recently had a job opening for "International Broadcaster (Multimedia) (Kirundi/Kinyarwanda)." Kirundi and Kinyarwanda are the mother tongues, national languages, and co-official languages in, respectively, Burundi and Rwanda. And they are mutually intelligible, with only minor differences, such that apparently a fluent speaker of either could work on a program serving speakers of both. But there is no term covering both - unless one counts the hyphenated Rwanda-Rundi - and no language coding category to cover material designed for use across the two.

This is a situation encountered with many languages in Africa, and one for which there is at least one potential solution - the neologism and language coding category "macrolanguage." There are actually some macrolanguages defined in Africa, but these are few, and as I discuss below, kind of accidental. Is it time to systematically identify (and code) macrolanguages in Africa?

What defines a language?

For most of us, the distinction between languages seems pretty straightforward. But beyond the most spoken international languages - those used officially by the United Nations or ones you are likely to see on a school curriculum - the situation is often more complex. Sometimes two or more closely related languages are so similar that their speakers can understand each other, but sometimes variations within one language can make understanding difficult. An earlier posting on this blog looked at the notion of "neighbor languages" in Scandinavia and Africa. A broader consideration of these issues by Columbia University's John McWhorter suggests that we're really all speaking dialects, some of which benefit from written forms, and one might add, status, resources, and policy support. There is some truth to the saying that "A language is a dialect with an army and a navy."

However, the issues of what to call a "language" and where to draw the boundaries between it and another "language" are still of practical importance for communication (standardization, references, ICT use) and planning (government, business, education). There are two broad approaches in linguistics to doing this, corresponding with the splitter/lumper (or joiner) approaches to categorizing:  one focusing more on distinctions, and the other focusing more on commonalities.

Without going too deeply into that discussion, which gets more complicated when accounting for issues of identity, names, written forms, and national boundaries, suffice it to say that in considering African languages, there are many situations where one encounters the splitter/lumper choice.

The major reference of languages in the world, Ethnologue, takes a more splitter approach, which means that speech varieties that are closely related and interintelligible may be classified as separate languages. It is their estimate of the number of language in Africa (over 2000) that is most commonly cited, but there are other more conservative estimates.A good academic discussion of this issue entitled "How many languages are there in Africa?" was published in 2004 by Jouni Filip Maho (his estimate is under 1500).

What is a "macrolanguage"?

To make the story brief, the term "macrolanguage" is not a term that was used in linguistic description before the inauguration of the  ISO 639-3 system for encoding all languages in the late 2000s. Since that system is based on Ethnologue's "splitter" data, a new category was needed to accommodate existing codes in the earlier less comprehensive parts of ISO 639 (1&2) that in many cases were more "lumper" in approach. The term macrolanguage was in effect a "shim," to borrow someone else's term, to fit the two systems together.

There are by my count 14 macrolanguages listed for Africa (names linked to the Ethnologue macrolanguage pages): Akan; Arabic; Dinka; Fulah; Gbaya; Grebo; Kalenjin; Kanuri; Kongo; Kpelle; Malagasy; Mandingo; Oromo; and Swahili. There could be others.

That brings us back to Kinyarwanda and Kirundi. How is the relationship between them different - more distant - than any of the above established macrolanguages? One difference, as mentioned above, is no common name to make it easy, and another is that they are dominant in different countries - perhaps analogous to the situation of Scandinavian languages?

Another curious situation is that of Mandingo, which includes several western Manding languages, but not Bambara and Jula (Dyula). Even if the latter two were considered too different from the other Manding tongues, they are close enough that one could localize software for the two together. Keep in mind also that the emerging literary standard N'Ko covers all Manding languages (in a different alphabet). Should the Mandingo macrolanguage be extended to include them all?

The four languages of southwestern Uganda - Kiga, Nkore, Nyoro, ajd Tooro - are close enough to be covered by Runyakitara, a proposed (but not encoded) standard which is being used in various ways, including at least some teaching and a localization of the Google interface. Should these four be considered a macrolanguage under perhaps that same name, thus finally providing a code for localization in Runyakitara?

And there are other examples around the continent that could be discussed.

What good would more macrolanguages do?

The first benefit of identifying more macrolanguages would be in language coding - the very environment in which the term was first used. The language of VOA's website for its Kinyarwanda/Kirundi service - - is coded as "rw" (Kinyarwanda) since there is no macrolanguage code covering both languages. Likewise, in many cases, the grouping of very close and mutually intelligible languages as a macrolanguage could facilitate localization of software and apps to serve larger populations - and those larger markets could make it more likely that such localization would be pursued and maintained.

Another benefit would be to complement the tendency in language coding towards seeking more granularity, by recognizing natural groupings of languages (for more on this, see a message to the IETF-languages list last May). In effect providing more balance between splitting and lumping/joining.

In the broader picture, identifying macrolanguages could have benefits for policymaking and program development involving languages within macrolanguage groups, by calling attention to the closely related languages. Especially where foreigners are involved, projects may overlook such relationships and the potential resources they may provide. For example materials development for education, and various communication needs might benefit from tapping efforts and resources in closely related languages.

(Minor edits and image added, 2 Sep. 2016)

Sunday, June 19, 2016

TED talks in African languages?

Of all the TED and TEDx talks - a genre of knowledge sharing that began in the 1980s but went "viral" with the possibilities offered by YouTube - have any been given in any African language? The question is not so easy to answer as I'll get to below, but the process of trying to answer it gives rise to other questions such as: Could a TED talk or a TEDx event be given in one or several African languages?

Image source:

TED - "Ideas Worth Spreading"

TED, an acronym for Technology, Entertainment, Design, "is a global set of conferences run by the private nonprofit organization, Sapling Foundation." The idea of the conferences is sharing of ideas "usually in the form of short, powerful talks (18 minutes or less)."

The conferences have been held mainly in North America and Europe, with a handful in Asia and Latin America. One, in 2007, was held in Arusha, Tanzania with the theme, "Africa: The Next Chapter." Many, but not all, of the talks in these events become videos featured online.

The talks, which total some "2200+" according to the website, are apparently all given in English. (The program for the 2007 conference in Arusha is not available online to check.) Quite a number of talks are subtitled in other languages, as I'll discuss further on.

TEDx - "x = independently organized event"

Image adapted from:
TEDx events, of which there are several types, are licensed by TED but organized separately. The number of TEDx events around the world is not stated anywhere I looked, but one list includes 2967 events (number from the line count in my text editor), and a nice interactive map display includes some past events that are not on that list (I randomly checked some in Africa).

The total number of talks at these independent conferences must therefore be staggering. The drop-down list in the sidebar of the TEDx languages page lists 43 languages, of which the only African one is Arabic (to that extent, my first question in the opening paragraph above would be answered in the affirmative). However, given the large number of TEDxs that have been held in many diverse locations around the world, is it possible that there have been presentations in other languages not on that list?

From a rough count of TEDx events in Africa in 2015 on the map mentioned above, there were ~80 events, with well over half in diverse locations in sub-Saharan Africa. Were presentations in places like for example Kano, Nigeria, Dar es Salaam, Tanzania, and Addis Ababa, Ethiopia all English-only?

Subtitling of TED talks

According to the translation page on the TED site - there has been subtitling of talks in over 100 languages (the actual count on the page is 110, thanks again to copy-paste & line-count, but that number includes some varieties of the same languages, as well as English originals). The African languages among these, with their count of how many talks, include: Afrikaans (19); Amharic (13); Arabic (2091); Arabic, Algerian (9); Hausa (1); Igbo (1); Somali (20); and Swahili (33).

The one talk (in English) with Hausa subtitles - embedded below - was given in 2003 and with the subtitles evidently added in 2008. Worth noting that the Boko orthography is used, as you can see with the hooked consonants.

The one talk with Igbo subtitles does not appear to follow the standard orthography - the lack of subdot vowels is one giveaway, but also tone marks are absent. And there are untranslated English terms - the first instance I recall seeing of code-mixing in subtitles. The other language subtitles look polished, though I'm even less in the position to evaluate them.

TEDx talks, as noted above, come in various languages, and apparently some of them have same-language subtitling, although that term is not used (for example several dozen in French).

The translation/subtitling effort itself looks like a successful involvement of volunteer contributions for at least a number of languages.

TED or TEDx in African languages?

There are two ways to achieve more linguistic diversity relevant to Africa in TED talks. The first would be through expanding the translation program mentioned above.This might require some new approaches as the volunteer model may not work as well as in Northern countries. The benefit would be expanding access, particularly with some more widely spoken African languages.

The second would be to organize (more?) TEDx events that either allow presentations in African languages, or that explicitly invite presentations in one or more African language(s). This would seem to be an interesting way to bring in diverse presenters, and to develop recorded content that could be shared locally, nationally, or regionally (depending on the language demographics). Even for those without internet or mobile access to such TEDx recordings, it might be possible in some contexts to distribute video for TV and audio-only for national and community radio. And such content could of course be translated into other languages for wider dissemination.

Ideas for sharing, after all, can come in many languages.