Minh’s Notes

Human-readable chicken scratch

Tuesday, November 26th, 2013

Herding the world’s languages


Wikipedia in 2013
Wikipedia’s front door has changed little in nearly a decade.

The wall of languages at www.wikipedia.org happens to be one of the most frequently accessed series of bits on the Internet. It’s also a monument to multilingualism: a degree in modern languages may help you decipher a tenth of the page, but only after installing an assortment of obscure fonts you’ll never need for any other purpose.

Despite the page’s cognitive complexity, the whole setup is far simpler than any other portal you’ll ever visit, every bit as primitive as the design suggests. The front page of the world’s #6 website is nothing more than a hand-written, static HTML5 document that references one hand-written, dynamically minified stylesheet and one hand-written, dynamically minified JavaScript file, plus AJAX search suggestions. That’s it – no dynamic content, no analytics, no A/B testing, no special logged-in version. Everyone sees exactly the same content. When a language edition gets its thousandth article, it falls to a thankless volunteer administrator at the Wikimedia Meta-Wiki to notice the change and edit the portal manually. (Oh, and the minification was done by hand too until earlier this year.)


www.wikipedia.org is written like any Wikipedia article – almost.
Anyone can edit the portal’s temporary staging area. It’s up to administrators like me to deploy the edits.

The Wikimedia Foundation loves this old-school approach, because it saves a tremendous amount of bandwidth and gives the site a nice homegrown, organic feel to it, like that other minimalist product of San Francisco, craigslist. But doing the portal this way also has a high maintenance cost, so historically no one maintained it. I got so fed up with nagging administrators that I became one myself in 2006. Over the years, the portal has remained true to its Web 1.0 self. Aside from updates to the language lists and periodic code refactoring, the design has changed little in nearly a decade. On the technical side, support for Internet Explorer 5.5 for Windows was dropped only a few years ago, and IE 6 is still the baseline. Major design changes – say, sorting the top ten languages differently, or creating a new list for million-article-plus wikis – has required endless discussion or that dreaded Wikipedia tradition known as a poll.

Many of us have long wanted a more sophisticated way of allowing the user to select a language, or at least a more attractive one. But fear of the community at large has scuttled every radical departure from the current method of selecting a language edition, ideas like choosing from a map. It didn’t help that the portal’s purpose was misunderstood among the very people who could help, designers. A Lithuanian design agency made a splash last year with a redesign that, among other things, collapsed the sea of language links into a 16-pixel-tall, rainbow-colored strip along the top for access to just 15% of Wikipedia’s language editions (including Lithuanian, thankfully). The point was to maximize the space dedicated to search, supposedly the portal’s main function. I guess the colors were a concession to German Wikipedians who still wanted to know how close they were to beating the English Wikipedia in size.


German redesigned
Wikipedia Redefined: The design firm New proposed emphasizing search by making it harder for roughly two-thirds of Wikipedia’s users to find the wikis in their native languages.

In a perfect world, Wikipedia would know what language everyone prefers to read in and would immediately direct them to a portal in their language, with search right up front. But in a perfect world, we would just direct everyone to the Esperanto Wikipedia. Unfortunately, language selection, not search, must be the portal’s main function. The article counts are just the most obvious and transparent way to arrange the wikis, based on an ancient compromise. Don’t get me wrong: more emphasis on search would be a great idea – on each wiki’s front page. I did just that in a radical redesign of the Vietnamese Wiktionary’s front page a couple years ago.


The Vietnamese Wiktionary places search front and center.
The design of the Vietnamese Wiktionary’s front page emphasizes search. Dynamically rotating examples show the project’s breadth and encourage you to search for words in any language from the same search box. You just have to get to the Vietnamese Wiktionary first, which is why the multilingual portals must be multilingual.

I’ve updated the Wikipedia portal far more than anyone else in the seven years I’ve been an administrator. This fact gives me mixed feelings. On the one hand, it’s a unique role for a Web developer, but on the other, it’s time-consuming and extremely constrained. That role can be described as nothing more than “link herder”. In the past couple years, distractions from other projects and real life (and, I admit, sheer boredom) caused me to ignore the portal entirely. Others in the community continued to keep it updated and make improvements to the code, but deployments did slow a bit.

Recently, though, I was moved to pity for the portal. The famous “top ten” ring of languages around the puzzle ball had gotten a bit warped, probably the result of blind copy-pasting over the years. The grid of sister projects at the bottom had gotten misaligned, too. And the logos were all blurry on high-resolution screens.


The ten largest wikis formed quite an imperfect circle around Wikipedia’s puzzle ball logo.
The portal had some issues while I was gone.

After a little CSS-fu and a lot of patience with the image uploader, the same 2005 layout is now cleaner and a little more responsive. Also, in modern browsers, the search bar now supports 277 languages, up from the original 47, provided you use a localized browser or set your language preferences.

Of course, one thing led to another, and soon I was trying to tackle the very tedium that caused me to drop out of sight for two years. Updating a portal was always a laborious process that included visiting each of the top ten wikis and all the wikis on the cusp of reaching an article count milestone. There was a page that listed all the article counts, but it too was updated only sporadically, the result of yet another manual process.

Earlier this year, the Foundation enabled Lua scripting on all its wikis, including Meta-Wiki. Advanced Wikipedia editors no longer had to write template code, the Turing-incomplete programming language to article writers’ wikitext. At around the same time, a community member developed a bot that automatically compiles up-to-date article counts every night. Changes like these are huge steps away from the static publishing world Wikipedia has always lived in.

This weekend, I wrote a Lua script that connects the dots, parsing the table of article counts and the portal HTML and identifying things that need to be updated. When there are major issues, like a language that needs to be promoted up to the next “bookshelf”, it displays these issues in a basic dashboard and adds the portal to a category that tracks urgent tasks for administrators. It’s essentially an automated test of the portal.


The Lua module’s dashboard currently lists several issues that need to be addressed in the portal code.
Looks like I have some work to do.

The next step is to generate the HTML entirely in Lua, but administrators will still be needed to manually deploy each automatically calculated change. I hope these changes will help the other administrators take a more active role in keeping the portal up-to-date and do so without introducing regressions. Someday, though, it’d be great if Wikipedia would be smarter about the first foot it puts forward.


Short-term memory

  1. Herding the world’s languages

    (11/26/2013)

    The wall of languages at www.wikipedia.org is one of the most frequently accessed series of bits on the Internet, and the whole setup is far simpler than any other portal you’ll visit, every bit as primitive as the design suggests. I’m trying to change that.

  2. AVIM on GitHub

    (11/05/2013)

    AVIM’s code base is now hosted on GitHub instead of this website’s Subversion repository.

  3. Cue the newbies

    (10/28/2013)

    While some of the Wikimedia Foundation’s English-language projects resist attempts to modernize the editing experience, the Vietnamese-language projects are moving full steam ahead.

  4. Barriers to entry

    (10/27/2013)

    Good people with good intentions, good ideas, and good writing skills may nonetheless be unable to grok “wikitext”, the underlying language of Wikipedia’s entries. Why exclude them?

  5. The ’90s called

    (8/05/2013)

    It’s always bothered me that, for a website that bills itself as “more obscure than an IP address”, the domain name “1ec5.org” is empirically simpler than an IP address.


The name’s Minh Nguyen, though I style myself Minh Nguyễn, with all the wonderful diacritics. I’m a graduate of St. Columban, St. Xavier, and Stanford, and currently a software developer in the San Francisco Bay Area. Since March 2002, Minh’s Notes has been home to my occasional insights and frequent attempts at humor.