Hello, I’m Minh Nguyen (though I style myself Minh Nguyễn, with all the wonderful diacritics), a graduate of St. Columban School and St. Xavier High School and currently a sophomore at Stanford University. Passing by my dorm room, you might’ve seen me staring at the monitor, the monitor mutually staring back, as I type… click… type… click— blog

October 13, 2014

Among my many roles in the Wikipedia project, I play the part of historian. Not the kind who obsesses over Civil War battles and World War I artillery, building up infoboxes the size of the USS Enterprise. That’s History, uppercase. No, I add historical content to non-history articles – lowercase history. Most articles need lowercase history to provide essential context and flavor. It’s not enough to know how things are; we need to know how things got that way and how we found out about it.

Over the past three months, I more than doubled the prose in “Flag of Ohio”, mostly by elaborating on the circumstances around the flag’s adoption. The resulting text demonstrates the power of lowercase history to link diverse topics together, in this case, the state seal, the flags of Cincinnati and Cuba, and President Garfield. I even drew up a big GIF of the proper way to fold an Ohio flag, because GIF.

Folding the flag of Ohio
The flag of Ohio is officially folded in 17 steps, easier said than done.

Once in a while, there’s even a chance to advance scholarship on a topic. Scouring Google Books led me to long forgotten accounts of an earlier Ohio flag. (It’s actually pretty boring, just a white rectangle with some details on it. I’m glad it never took off.) My sudden activity on that article attracted the attention of another editor, who gradually ate away at a factoid all my social studies teachers in school had repeated as fact: that Nepal and Ohio were the only country and state, respectively, with non-rectangular flags. In fact, there are plenty of counterexamples, from European naval ensigns to the Qing dynasty’s triangular Yellow Dragon Flag.

In another case from earlier this year, I finally quashed the silly misconception that phở is based on a French soup and even named after it. Apparently no one in the English-speaking world, not even the OED, had bothered to check with scholars fluent in Vietnamese to see whether the historical literature backed up that myth. (For the record, Cantonese speakers had much to do with the name, while the dish evolved from a Vietnamese water buffalo soup called xáo trâu. Eww?)

It’s more difficult for a Wikipedia editor to write about lowercase history than to write about the present, because Wikipedia has a stringent policy requiring “verifiable” sources. It’s easy to find websites, books, and reviews raving about phở and easy for another editor to double-check that source. But as soon as you start writing about lowercase history, you run up against all sorts of barriers: paywalls for year-old news articles, paywalls for decade-old news articles, ditto for century-old news articles that should’ve been out of copyright for generations.

Thankfully, Google (Books, Scholar, News Archive Search), HathiTrust, the Internet Archive, and various national library websites do provide access to a huge number of sources for free, if you happen to be looking for something in the right time period. If you’re looking into local or regional history, subscription databases offer even more. Depending on the state of their budget, your local library may provide access one or two good subscription databases. If not, there’s The Wikipedia Library, but you have to apply for access.

Still, searching this wealth of sources can be difficult because OCR is nowhere near as good as you’d expect in 2014, and it’s virtually absent from older or foreign-language documents. So sometimes the best sources can only be found with some guesswork: what kind of publication would cover the topic and in what years? What appears to be an original source might turn out to be regurgitated from a decade earlier, in which case the investigation starts anew.

First State Flag
I came across this Enquirer blurb (subscription required) while searching for details on Ohio’s first flag. It nearly had me going, until I saw the date: April 1, 1905, three years after the familiar double-tailed flag was adopted. Does it qualify as an April Fool’s joke if the humor is a bit stale?

There’s also the problem of bias in historical sources. I came across a great deal of vitriol directed at the flags of Ohio and Cincinnati when they were introduced and came away thinking that they were poorly received at first. In fact, it wasn’t so lopsided, but of the subscription databases I had access to, the only one covering that time period was for a highly partisan Democratic newspaper. Both flags were introduced by Republicans. (These days, that paper, The Cincinnati Enquirer, has about as much edge as that former Ohio flag.) For the phở article, too, I had to remain mindful that some French- and Vietnamese-language sources were more interested in claiming the soup for their country than establishing the truth.

Lowercase history is the most inefficient, time-consuming way to expand an article but the most effective way to increase its quality. Very often, it forces you to square competing narratives and question the assumptions that underlie the contemporary description of a topic. It also builds the reader’s trust by increasing the number and variety of sources beyond the low-hanging fruit that anyone could find via Google search.

These days, at the English Wikipedia particularly, it’s easy to feel that all the good topics have been written about. But the truth is that most of those articles still have plenty of room to grow. If you toss out labels like amateur historian, I think you’d find that writing a coherent encyclopedia depends in large part on how many fields of study you can lowercase.

November 26, 2013

Wikipedia in 2013
Wikipedia’s front door has changed little in nearly a decade.

The wall of languages at www.wikipedia.org happens to be one of the most frequently accessed series of bits on the Internet. It’s also a monument to multilingualism: a degree in modern languages may help you decipher a tenth of the page, but only after installing an assortment of obscure fonts you’ll never need for any other purpose.

Despite the page’s cognitive complexity, the whole setup is far simpler than any other portal you’ll ever visit, every bit as primitive as the design suggests. The front page of the world’s #6 website is nothing more than a hand-written, static HTML5 document that references one hand-written, dynamically minified stylesheet and one hand-written, dynamically minified JavaScript file, plus AJAX search suggestions. That’s it – no dynamic content, no analytics, no A/B testing, no special logged-in version. Everyone sees exactly the same content. When a language edition gets its thousandth article, it falls to a thankless volunteer administrator at the Wikimedia Meta-Wiki to notice the change and edit the portal manually. (Oh, and the minification was done by hand too until earlier this year.)

www.wikipedia.org is written like any Wikipedia article – almost.
Anyone can edit the portal’s temporary staging area. It’s up to administrators like me to deploy the edits.

The Wikimedia Foundation loves this old-school approach, because it saves a tremendous amount of bandwidth and gives the site a nice homegrown, organic feel to it, like that other minimalist product of San Francisco, craigslist. But doing the portal this way also has a high maintenance cost, so historically no one maintained it. I got so fed up with nagging administrators that I became one myself in 2006. Over the years, the portal has remained true to its Web 1.0 self. Aside from updates to the language lists and periodic code refactoring, the design has changed little in nearly a decade. On the technical side, support for Internet Explorer 5.5 for Windows was dropped only a few years ago, and IE 6 is still the baseline. Major design changes – say, sorting the top ten languages differently, or creating a new list for million-article-plus wikis – has required endless discussion or that dreaded Wikipedia tradition known as a poll.

Many of us have long wanted a more sophisticated way of allowing the user to select a language, or at least a more attractive one. But fear of the community at large has scuttled every radical departure from the current method of selecting a language edition, ideas like choosing from a map. It didn’t help that the portal’s purpose was misunderstood among the very people who could help, designers. A Lithuanian design agency made a splash last year with a redesign that, among other things, collapsed the sea of language links into a 16-pixel-tall, rainbow-colored strip along the top for access to just 15% of Wikipedia’s language editions (including Lithuanian, thankfully). The point was to maximize the space dedicated to search, supposedly the portal’s main function. I guess the colors were a concession to German Wikipedians who still wanted to know how close they were to beating the English Wikipedia in size.

German redesigned
Wikipedia Redefined: The design firm New proposed emphasizing search by making it harder for roughly two-thirds of Wikipedia’s users to find the wikis in their native languages.

In a perfect world, Wikipedia would know what language everyone prefers to read in and would immediately direct them to a portal in their language, with search right up front. But in a perfect world, we would just direct everyone to the Esperanto Wikipedia. Unfortunately, language selection, not search, must be the portal’s main function. The article counts are just the most obvious and transparent way to arrange the wikis, based on an ancient compromise. Don’t get me wrong: more emphasis on search would be a great idea – on each wiki’s front page. I did just that in a radical redesign of the Vietnamese Wiktionary’s front page a couple years ago.

The Vietnamese Wiktionary places search front and center.
The design of the Vietnamese Wiktionary’s front page emphasizes search. Dynamically rotating examples show the project’s breadth and encourage you to search for words in any language from the same search box. You just have to get to the Vietnamese Wiktionary first, which is why the multilingual portals must be multilingual.

I’ve updated the Wikipedia portal far more than anyone else in the seven years I’ve been an administrator. This fact gives me mixed feelings. On the one hand, it’s a unique role for a Web developer, but on the other, it’s time-consuming and extremely constrained. That role can be described as nothing more than “link herder”. In the past couple years, distractions from other projects and real life (and, I admit, sheer boredom) caused me to ignore the portal entirely. Others in the community continued to keep it updated and make improvements to the code, but deployments did slow a bit.

Recently, though, I was moved to pity for the portal. The famous “top ten” ring of languages around the puzzle ball had gotten a bit warped, probably the result of blind copy-pasting over the years. The grid of sister projects at the bottom had gotten misaligned, too. And the logos were all blurry on high-resolution screens.

The ten largest wikis formed quite an imperfect circle around Wikipedia’s puzzle ball logo.
The portal had some issues while I was gone.

After a little CSS-fu and a lot of patience with the image uploader, the same 2005 layout is now cleaner and a little more responsive. Also, in modern browsers, the search bar now supports 277 languages, up from the original 47, provided you use a localized browser or set your language preferences.

Of course, one thing led to another, and soon I was trying to tackle the very tedium that caused me to drop out of sight for two years. Updating a portal was always a laborious process that included visiting each of the top ten wikis and all the wikis on the cusp of reaching an article count milestone. There was a page that listed all the article counts, but it too was updated only sporadically, the result of yet another manual process.

Earlier this year, the Foundation enabled Lua scripting on all its wikis, including Meta-Wiki. Advanced Wikipedia editors no longer had to write template code, the Turing-incomplete programming language to article writers’ wikitext. At around the same time, a community member developed a bot that automatically compiles up-to-date article counts every night. Changes like these are huge steps away from the static publishing world Wikipedia has always lived in.

This weekend, I wrote a Lua script that connects the dots, parsing the table of article counts and the portal HTML and identifying things that need to be updated. When there are major issues, like a language that needs to be promoted up to the next “bookshelf”, it displays these issues in a basic dashboard and adds the portal to a category that tracks urgent tasks for administrators. It’s essentially an automated test of the portal.

The Lua module’s dashboard currently lists several issues that need to be addressed in the portal code.
Looks like I have some work to do.

The next step is to generate the HTML entirely in Lua, but administrators will still be needed to manually deploy each automatically calculated change. I hope these changes will help the other administrators take a more active role in keeping the portal up-to-date and do so without introducing regressions. Someday, though, it’d be great if Wikipedia would be smarter about the first foot it puts forward.

November 5, 2013

AVIM has been in maintenance mode (read: an afterthought) for a couple years now, so I’m happy to announce that the code base is now hosted on GitHub instead of this website’s Subversion repository. Now input method geeks can easily tinker with Hiếu Đặng’s well-regarded (if opaque) input method engine, as well as my pioneering work to turn it into an application-wide service and support custom WYSIWYG editors.

The old Subversion repository isn’t going away, but new releases will be built from the Git repository going forward. With luck, there may even be new releases. Someday.

October 28, 2013

While the Wikimedia Foundation’s flagship project, the English Wikipedia, resists attempts to modernize the editing experience, the Vietnamese-language projects are moving full steam ahead.

Back in August, I pushed the Foundation to install VisualEditor at the Vietnamese Wikipedia ahead of schedule, giving us extra preparation time. Since then, we’ve translated the tool, written help pages, documented templates for use in VisualEditor, and addressed incompatibilities with Vietnamese input method editors. Given the positive reaction so far, I’m confident that it’ll be much more welcome here than at the English Wikipedia when the Foundation is finally ready to roll it out by default.

Wikipedia is the easy part. By contrast, Wiktionary relies on a painfully obfuscated syntax based on Wikipedia’s wikitext, but with a heavier reliance on templates. This syntax evolved in response to a fundamental technical limitation: whereas most wikis host unstructured prose, a dictionary like Wiktionary needs to hold structured data. So a user who has conquered Simonite’s example Wikipedia sentence will find themselves once again confounded by the English Wiktionary entry on “technology”:


From {{etyl|grc|en}} {{term|τεχνολογία|lang=grc|tr=tekhnologia||systematic treatment (of grammar)}}, from {{term|τέχνη|tr=tekhne|lang=grc||art}} + {{term|-λογία|lang=grc}}.

* {{a|RP}} {{IPA|/tɛkˈnɒlədʒi/}}, {{X-SAMPA|/tEk"nQl@dZi/}}
* {{a|GenAm}} {{IPA|/tɛkˈnɑlədʒi/}}, {{X-SAMPA|/tEk"nAl@dZi/}}


# {{context|uncountable|lang=en}} The organization of knowledge for practical purposes.

At least language purists can take heart that English won’t be so easily perverted. And yet, this is the stand the English Wiktionary took in favor of learnability and against the even more obfuscated system that Wiktionary’s other language editions adopted years ago. Witness the Vietnamese Wiktionary’s corresponding entry:

* [[Wiktionary:IPA|IPA]]: {{IPA|/tɛk.ˈnɒː.lə.dʒi/}} {{term|Anh}}, {{IPA|/tɛk.ˈnɑː.lə.dʒi/}} {{term|Mỹ}}

| lang = grc | term = τεχνολογία | rom = tekhnologia | meaning = ngữ pháp đầy đủ | from = {{etym-from
 | term = τέχνη | rom = tekhne | meaning = nghệ thuật
 | 2 term = -λογία

# [[kỹ thuật|Kỹ thuật]]; kỹ thuật [[học]].

Clearly, this syntax was a mistake. We adopted it on promises that machine-readability would encourage developers to support our wiki, but no one ever did. For years, even experienced Vietnamese Wikipedia editors have shied away from contributing to Wiktionary because of it. On the other hand, it allows us to keep an up-to-the-minute breakdown of entries by language, which is pretty handy.

To its credit, the English Wiktionary community does recognize the need to reduce complexity, so users wishing to start a new entry are offered a choice between two guided entry creators. The simpler option (login required) starts with some boilerplate wikitext and provides long-winded instructions for modifying it. It’s a serviceable, if-you-say-so experience for beginners. The more powerful option (login required) expects you to input the word’s ISO 369 language code, which is a nonstarter for ordinary folks. The English Wiktionary also provides a nifty tool for adding translations to an existing entry – provided it already contains at least one translation.

But I’m not convinced that the English Wiktionary is doing enough to make the site accessible to those who speak English, not wikitext, as a first language. Like Urban Dictionary, Wiktionary relies much more on casual contributors than Wikipedia. Consequently, the ideal form for creating a minimal entry would require no more than a single single-line textbox. Any more complexity and the casual contributor is much more likely to give up. Why spend ten minutes just to add a single sentence to the wiki?

The Vietnamese Wiktionary has a lot going against it, but here too we’ve made great improvements in the past year. Earlier this year, we simplified some of our most complex templates. Generating an IPA pronunciation guide for a Vietnamese word went from {{IPA|/{{VieIPA|đ|ơ|n}} {{VieIPA|g|i|ả|n}}/}} to simply {{vie-pron}}. This weekend, we turned on a brand-new entry creation tool, one that assumes no wiki expertise.

Creating a new entry at the Vietnamese Wikipedia.

The new tool walks you through the process of writing an entry, presenting the following steps, one at a time:

  1. Choose a language from the dropdown menu.
  2. Choose a part of speech from the dropdown menu.
  3. Enter a definition into the single-line textbox. As you type, an additional single-line textbox appears for another definition, if applicable. Same for synonyms and translations (for Vietnamese entries only).
  4. Click Continue. Each word in your definitions is automatically linked. (The tool checks for compound words that have Wiktionary entries.) The generated wikitext appears, in case you want to tweak anything.
  5. Click Save and be on your merry way.

Give it a try (login required). The link goes to your personal sandbox, so no knowledge of Vietnamese is required.

Notice how each step comes with few or no instructions. That’s by design: most people either don’t bother reading instructions or get so bogged down in instructions that they quit. We learned this lesson last December, when we stripped most of the instructions and scary admonitions from the Vietnamese Wikipedia’s editing page, shortening the page by 30%, and the sky didn’t fall:

Before After
Creating a new article at the Vietnamese Wikipedia, before (left) and after (right).

Unlike the English Wiktionary’s tools, the Vietnamese Wiktionary’s new entry creation tool appears automatically when you happen upon a nonexistent entry. You can’t miss it.

It’s too soon to tell whether the new tool will attract more contributors. I’m hopeful, because creating entries is finally something everyone can get right, quickly. The new entries will also require less cleanup, thanks to a lack of boilerplate and the tool’s automatic linking features.

Of course, creating entries is just the beginning. We still need better tools for editing entries, which are still written in a horribly complex syntax. The first step, which I turned on by default just moments ago, is “ToT”, a dynamically updated table of contents beside the edit box:

Editing “lavar” at the Vietnamese Wiktionary, with ToT on the right. Clicking a heading in the sidebar selects the code that produces that heading. Try it out for yourself.

We can’t easily change the syntax, but we can give you instant feedback on your edits and help you navigate entries with less effort.

ToT is based on a table of contents feature that the Foundation had originally intended to turn on for all wikis. They later backed off, no doubt due to pressure from the community.

I hope to bring some of these improvements over to the English Wiktionary once we get good data on their effectiveness at the Vietnamese Wiktionary. In the meantime, I can’t wait to see what the newbies come up with.

October 27, 2013

In the current issue of MIT Technology Review, Tom Simonite sounds the alarm over Wikipedia’s “decline”. The story of Wikipedia’s decline and/or demise has been published ad nauseum since the project’s inception, only for Wikipedia to grow in popularity still. However, Simonite doesn’t argue that Wikipedia’s days are numbered – in fact, quite the opposite. Instead, he claims that its potential for improvement in breadth and quality is being undermined by its core of experienced editors. It isn’t only red tape and “policy creep” that stunts the project, but also, more importantly, an unwillingness to make the site’s editing tools more accessible to the general public:

But in the topsy-turvy world of the encyclopedia anyone can edit, it’s not a fringe opinion that making editing easier is a waste of time. The characteristics of a dedicated volunteer editor—[Sue] Gardner lists “fussy,” “persnickety,” and “intellectually self-confident”—are not those that urge the acceptance of changes like Visual Editor [sic].

After the foundation made Visual Editor the default way to edit entries, Wikipedians rebelled and complained of bugs in the software. In September, a Request for Comment, a survey of the community, concluded that the new interface should be hidden by default. The foundation initially refused, but in September a community–elected administrator released a modification to Wikipedia’s code to hide Visual Editor. The foundation gave in. It made Visual Editor opt-in rather than opt-out—meaning that the flagship project to help newcomers is in fact invisible to newcomers, unless they dig through account settings to switch the new interface on.

As a hard-core Wikipedia editor, I live and breathe the “wikitext” syntax from which VisualEditor is supposed to shield new editors. So I can sympathize with editors who fear that novices, newly armed with VisualEditor, will run roughshod over the markup we painstakingly tweak. But as an administrator with responsibility for some of the Wikimedia Foundation’s other wikis, I understand the need to lower Wikipedia’s barriers to entry. There’s nothing inherent about encyclopedia writing that should require above-average computing skills.

It’s frankly no fun to write documentation on wikitext or the many templates that a typical Wikipedia article requires. Normal sites have a getting started guide that looks something like:

  1. Register – it’s free!
  2. Type something into a big box and hit save.
  3. Profit (socially)!

By contrast, prospective Wikipedians must learn the basics of an ad-hoc computer language that verges on Turing completeness, navigate confusing, long-winded notability guidelines – where’s the decision tree? – and potentially run through a gauntlet of social norms.

In time, documentation can be improved. But the bigger issue to me is that the community is unwilling to accept VisualEditor, which represents a huge step forward in usability. The Wikimedia Foundation has been quite transparent and accommodating in the tool’s development and roll-out. But perhaps part of the problem is that VisualEditor was developed the way software is usually developed: on a deadline. So there are bugs, certainly, but why throw out a solid version 1 in search of something perfect?

The paradigm of WYSIWYG editing is sound. It’s a basic computing skill, nowadays taught in primary school in place of cursive writing. Good people with good intentions, good ideas, and good writing skills may nonetheless be unable to grok wikitext. Why exclude them?

The new interface is attractive and totally optional. Experienced editors will always be able to edit wikitext. The controversy is only about making VisualEditor the default editor going forward. Wikipedia won’t grow its contributor base through appeals to civic duty alone; it has to show the world that editing is easy to get right, quickly. Relegating VisualEditor to a preferences page – itself in need of a revamp – creates unnecessary hoops around the site’s single most important function.

Wikipedia has plenty of time to improve its back-office procedures, but I fear this is the last chance to modernize editing. VisualEditor has been quite possibly the Wikimedia Foundation largest undertaking, and it’s is far and away better than any community-developed wiki editor to date. But after seeing it acrimoniously dispatched by the community, will the Foundation’s donors ever again support an attempt at making Wikipedia usable?

Onerous barriers to entry contradict Wikipedia’s ethos of accessibility. Without accessibility, Wikipedia is just another website.



This weblog is licensed under a Creative Commons License.

Powered by Movable Type 4.38