Human-readable chicken scratch 801 times since March 2002
Hello, I’m Minh Nguyen (though I style myself Minh Nguyễn, with all the wonderful diacritics), a graduate of St. Columban School and St. Xavier High School and currently a sophomore at Stanford University. Passing by my dorm room, you might’ve seen me staring at the monitor, the monitor mutually staring back, as I type… click… type… click— blog…
Years ago, I started to collect government-issued road maps and atlases, procuring them for free mostly by stopping by roadside welcome centers and signing their guestbooks. Cartotourism requires a bit of tact: you don’t just waltz in and demand a government handout; you have to fein profound interest in the captivating state you’ve just entered. (Under the “Purpose of Visit” column: “Just passing through.”)
Admittedly it’s a bit perverse that I would care so much about the free map amid all the signs proudly advertising free coffee, Coke, or orange juice. But evidently I’m not alone. For me, the maps proved useful during road trips, even after a GPS device displaced the family radar detector. After hours of counting cows and spotting barn ads along the most remote stretches of I-65, even the highway department could somehow keep me entertained. Something about the way they managed to cram so many names and symbols onto one large sheet of paper.
This summer, the collection grew to 58 specimens issued by 27 states, four national parks, four counties, plus Ontario and the former Metro Toronto. Some are nearly 30 years old and have the tears to prove it. The collection sports two official bike maps, a beautiful “agritourism” map, and a completely bilingual map. (Ontario’s is half in French; Louisiana’s wishes it were.) Naturally, the two maps of Texas are by far the largest in my possession. Over the years, I’ve also lost a few maps, including one that proclaimed, “There’s More Than Corn in Indiana!” Indeed: I picked it up at a rest stop nestled amid soybean fields.
Occasionally, I try to do something more interesting with the collection than keep it in a burgeoning shoebox. This time, I made it into a single U.S. map, fashioning states out of the maps they issued. It’s a map made of maps:
You’ll notice that the arrangement is rather uneven. My collection is heavily skewed towards the Southeast, mostly because I traversed it almost annually during my childhood, but also because the West and New England are quite stingy when it comes to maps. I must’ve discarded California’s map; it was just a page in a travel guidebook. And the only “welcome center” I could find in Rhode Island was a Mobil station selling Mobil maps.
Perhaps a more interesting project would be to spread out all these maps and stitch together a mosaic of the U.S. It’ll have to wait until I can find enough floor space to unfurl Texas.
Today is my last day at Apple. (It’s a fruit company – heard of it?) That mostly means no more product giveaways to this blog’s most insightful commenters. In a little over three years, no one ever qualified, sorry. It also means the Xcode team has one fewer engineer to help sort through fan mail. Apparently they’re called “bug reports” outside Cupertino, which explains the… expressivity I’d see sometimes. I have a lot to get used to.
As for where I’m going, that’ll be the topic of a later note, following the same protocol whereby your bank sends you your PIN in one envelope followed by an explanation of that PIN in another envelope after you’ve misplaced the first. All I can say is it has little to do with the startup idea I had back in 2009.
Among my many roles in the Wikipedia project, I play the part of historian. Not the kind who obsesses over Civil War battles and World War I artillery, building up infoboxes the size of the USS Enterprise. That’s History, uppercase. No, I add historical content to non-history articles – lowercase history. Most articles need lowercase history to provide essential context and flavor. It’s not enough to know how things are; we need to know how things got that way and how we found out about it.
Once in a while, there’s even a chance to advance scholarship on a topic. Scouring Google Books led me to long forgotten accounts of an earlier Ohio flag. (It’s actually pretty boring, just a white rectangle with some details on it. I’m glad it never took off.) My sudden activity on that article attracted the attention of another editor, who gradually ate away at a factoid all my social studies teachers in school had repeated as fact: that Nepal and Ohio were the only country and state, respectively, with non-rectangular flags. In fact, there are plenty of counterexamples, from European naval ensigns to the Qing dynasty’s triangular Yellow Dragon Flag.
In another case from earlier this year, I finally quashed the silly misconception that phở is based on a French soup and even named after it. Apparently no one in the English-speaking world, not even the OED, had bothered to check with scholars fluent in Vietnamese to see whether the historical literature backed up that myth. (For the record, Cantonese speakers had much to do with the name, while the dish evolved from a Vietnamese water buffalo soup called xáo trâu. Eww?)
It’s more difficult for a Wikipedia editor to write about lowercase history than to write about the present, because Wikipedia has a stringent policy requiring “verifiable” sources. It’s easy to find websites, books, and reviews raving about phở and easy for another editor to double-check that source. But as soon as you start writing about lowercase history, you run up against all sorts of barriers: paywalls for year-old news articles, paywalls for decade-old news articles, ditto for century-old news articles that should’ve been out of copyright for generations.
Thankfully, Google (Books, Scholar, News Archive Search), HathiTrust, the Internet Archive, and various national library websites do provide access to a huge number of sources for free, if you happen to be looking for something in the right time period. If you’re looking into local or regional history, subscription databases offer even more. Depending on the state of their budget, your local library may provide access one or two good subscription databases. If not, there’s The Wikipedia Library, but you have to apply for access.
Still, searching this wealth of sources can be difficult because OCR is nowhere near as good as you’d expect in 2014, and it’s virtually absent from older or foreign-language documents. So sometimes the best sources can only be found with some guesswork: what kind of publication would cover the topic and in what years? What appears to be an original source might turn out to be regurgitated from a decade earlier, in which case the investigation starts anew.
There’s also the problem of bias in historical sources. I came across a great deal of vitriol directed at the flags of Ohio and Cincinnati when they were introduced and came away thinking that they were poorly received at first. In fact, it wasn’t so lopsided, but of the subscription databases I had access to, the only one covering that time period was for a highly partisan Democratic newspaper. Both flags were introduced by Republicans. (These days, that paper, The Cincinnati Enquirer, has about as much edge as that former Ohio flag.) For the phở article, too, I had to remain mindful that some French- and Vietnamese-language sources were more interested in claiming the soup for their country than establishing the truth.
Lowercase history is the most inefficient, time-consuming way to expand an article but the most effective way to increase its quality. Very often, it forces you to square competing narratives and question the assumptions that underlie the contemporary description of a topic. It also builds the reader’s trust by increasing the number and variety of sources beyond the low-hanging fruit that anyone could find via Google search.
These days, at the English Wikipedia particularly, it’s easy to feel that all the good topics have been written about. But the truth is that most of those articles still have plenty of room to grow. If you toss out labels like amateur historian, I think you’d find that writing a coherent encyclopedia depends in large part on how many fields of study you can lowercase.
The wall of languages at www.wikipedia.org happens to be one of the most frequently accessed series of bits on the Internet. It’s also a monument to multilingualism: a degree in modern languages may help you decipher a tenth of the page, but only after installing an assortment of obscure fonts you’ll never need for any other purpose.
The Wikimedia Foundation loves this old-school approach, because it saves a tremendous amount of bandwidth and gives the site a nice homegrown, organic feel to it, like that other minimalist product of San Francisco, craigslist. But doing the portal this way also has a high maintenance cost, so historically no one maintained it. I got so fed up with nagging administrators that I became one myself in 2006. Over the years, the portal has remained true to its Web 1.0 self. Aside from updates to the language lists and periodic code refactoring, the design has changed little in nearly a decade. On the technical side, support for Internet Explorer 5.5 for Windows was dropped only a few years ago, and IE 6 is still the baseline. Major design changes – say, sorting the top ten languages differently, or creating a new list for million-article-plus wikis – has required endless discussion or that dreaded Wikipedia tradition known as a poll.
Many of us have long wanted a more sophisticated way of allowing the user to select a language, or at least a more attractive one. But fear of the community at large has scuttled every radical departure from the current method of selecting a language edition, ideas like choosing from a map. It didn’t help that the portal’s purpose was misunderstood among the very people who could help, designers. A Lithuanian design agency made a splash last year with a redesign that, among other things, collapsed the sea of language links into a 16-pixel-tall, rainbow-colored strip along the top for access to just 15% of Wikipedia’s language editions (including Lithuanian, thankfully). The point was to maximize the space dedicated to search, supposedly the portal’s main function. I guess the colors were a concession to German Wikipedians who still wanted to know how close they were to beating the English Wikipedia in size.
In a perfect world, Wikipedia would know what language everyone prefers to read in and would immediately direct them to a portal in their language, with search right up front. But in a perfect world, we would just direct everyone to the Esperanto Wikipedia. Unfortunately, language selection, not search, must be the portal’s main function. The article counts are just the most obvious and transparent way to arrange the wikis, based on an ancient compromise. Don’t get me wrong: more emphasis on search would be a great idea – on each wiki’s front page. I did just that in a radical redesign of the Vietnamese Wiktionary’s front page a couple years ago.
I’ve updated the Wikipedia portal far more than anyone else in the seven years I’ve been an administrator. This fact gives me mixed feelings. On the one hand, it’s a unique role for a Web developer, but on the other, it’s time-consuming and extremely constrained. That role can be described as nothing more than “link herder”. In the past couple years, distractions from other projects and real life (and, I admit, sheer boredom) caused me to ignore the portal entirely. Others in the community continued to keep it updated and make improvements to the code, but deployments did slow a bit.
Recently, though, I was moved to pity for the portal. The famous “top ten” ring of languages around the puzzle ball had gotten a bit warped, probably the result of blind copy-pasting over the years. The grid of sister projects at the bottom had gotten misaligned, too. And the logos were all blurry on high-resolution screens.
After a little CSS-fu and a lot of patience with the image uploader, the same 2005 layout is now cleaner and a little more responsive. Also, in modern browsers, the search bar now supports 277 languages, up from the original 47, provided you use a localized browser or set your language preferences.
Of course, one thing led to another, and soon I was trying to tackle the very tedium that caused me to drop out of sight for two years. Updating a portal was always a laborious process that included visiting each of the top ten wikis and all the wikis on the cusp of reaching an article count milestone. There was a page that listed all the article counts, but it too was updated only sporadically, the result of yet another manual process.
This weekend, I wrote a Lua script that connects the dots, parsing the table of article counts and the portal HTML and identifying things that need to be updated. When there are major issues, like a language that needs to be promoted up to the next “bookshelf”, it displays these issues in a basic dashboard and adds the portal to a category that tracks urgent tasks for administrators. It’s essentially an automated test of the portal.
The next step is to generate the HTML entirely in Lua, but administrators will still be needed to manually deploy each automatically calculated change. I hope these changes will help the other administrators take a more active role in keeping the portal up-to-date and do so without introducing regressions. Someday, though, it’d be great if Wikipedia would be smarter about the first foot it puts forward.
AVIM has been in maintenance mode (read: an afterthought) for a couple years now, so I’m happy to announce that the code base is now hosted on GitHub instead of this website’s Subversion repository. Now input method geeks can easily tinker with Hiếu Đặng’s well-regarded (if opaque) input method engine, as well as my pioneering work to turn it into an application-wide service and support custom WYSIWYG editors.
The old Subversion repository isn’t going away, but new releases will be built from the Git repository going forward. With luck, there may even be new releases. Someday.