Minh’s Notes

Human-readable chicken scratch

Minh Nguyễn
September 29th, 2008


Mười lũy thừa một trăm

Back in May, I remarked that VDict’s English↔Vietnamese machine translation service was too good at churning out sometimes incomprehensible Vietrish. I also pointed out that the major Web translation services, such as Babel Fish or Google Translate, hadn’t gotten around to supporting Vietnamese. Today, Google has – VDict now piggybacks on their service – and my first thought was to try and break it.

Another welcome

Once again, the Vietnamese Wikipedia’s opening paragraph:

Hoan nghênh bạn đã đến với Wikipedia tiếng Việt! Đây là bách khoa toàn thư có nội dung mở và thuộc sở hữu cộng đồng. Dự án được bắt đầu từ tháng 10 năm 2003 do công sức đóng góp của nhiều người ở khắp mọi nơi, bạn cũng có thể tham gia. Hiện giờ chúng ta có 86.752 thành viên (có tài khoản), nhưng mới chỉ đóng góp được 58.022 bài thôi. Rất mong sự tham gia tích cực của bạn!

Which roughly translates to:

Welcome; you’ve arrived at the Vietnamese Wikipedia! This is an open-content encyclopedia belonging to the community. The project began in October 2003, thanks to the efforts of many contributors worldwide; you can join in too. Currently, we have 86,752 members (with accounts) who’ve contributed only 58,022 articles. We really look forward to your active participation!

Surprisingly, Google Translate gets it mostly right:

Welcome to Wikipedia, the free encyclopedia! This is the encyclopedia content open and owned communities. The project was started from tháng 10 [“October”], 2003 by the contribution of many people everywhere, you can also participate. Now we have 86,752 members (of accounts), but contribute only be 58,022 items only. We hope the active participation of you!

Sure, it’s pretty ungrammatical, but at least they didn’t start rambling on about medication, like VDict did. Still, Google passed my little test because they rely on sophisticated statistical analysis techniques to determine which English phrases typically go with each Vietnamese phrase. Rather than simply looking individual words up in a dictionary and pumping out the matching words, the smart folks at Google seem to take into account the kinds of phrases actually in use on the Internet and normalize them, so that no matter the source language, Google internally represents each sentence the same way.

Regained in re-translation

This statistical technique usually allows Google’s translation to at least sound remotely relevant. But it also makes spotting errors more difficult. Case in point, a definition of the Moon:

Mặt Trăng (tiếng Latinh: Luna, ký hiệu: ☾) là vệ tinh tự nhiên duy nhất của Trái Đất và là vệ tinh tự nhiên lớn thứ năm trong Hệ Mặt Trời.

Quite straightforwardly, it means:

The Moon (Latin: Luna, symbol: ☾) is the Earth’s only natural satellite and the fifth-largest natural satellite in the Solar System.

But here’s what Google thinks it means:

Moon (Latin: Luna, symbols: ☾) is a natural satellite of only Earth and the satellite is the natural largest in the Torah.

First of all, English has this tricky feature where moving an adverb like “only” around the sentence actually changes the sentence’s meaning. But quibbles aside, I wouldn’t think to look in the Torah (the first five books of the Bible) for the Moon. I”d just look up. Coincidentally, the fact that the Moon is the fifth-largest moon was lost in translation.

Now, the experts used to always caution against using machine translation tools. They also advised that we reverse-translate anything we find using those tools, just to see how much gets lost in translation:

Moon (Latin: Luna, ký hiệu: ☾) là một vệ tinh tự nhiên của Trái đất và chỉ là các vệ tinh tự nhiên lớn nhất trong Hệ Mặt Trời.

Apparently, not much. This reverse-translation hides the various translation mistakes we saw before, because every word in the sentence above, when placed in exactly the same context as that sentence, will always have a 1:1 correspondence with a word in the target language. In other words, since “abc1def” can only ever translate to “vwx2y&z” and is the only bit of text that can, “vwx2y&z” can only ever translate to “abc1def”. So if you’re using Google to translate into a language you don’t know so well, you don’t really know how well or how poorly Google’s doing.

Counting oddly

So now, for kicks, a little stress test. Below we have a series of numbers spelled out, with all its idiosyncrasies:

0–30. Số không, một, hai, ba, bốn, năm, sáu, bảy, tám, chín, mười, mười một, mười hai, mười ba, mười bốn, mười lăm, mười sáu, mười bảy, mười tám, mười chín, hai mươi, hai mươi mốt, hai mươi hai, hai mươi ba, hai mươi tư, hai mươi lăm, hai mươi sáu, hai mươi bảy, hai mươi tám, hai mươi chín, ba mươi.

Một trăm (100), một trăm lẻ một (101), một trăm lẻ năm (105), một trăm hai mươi mốt (121), ba trăm (300).

Một ngàn (1.000), một ngàn lẻ một (1.001), hai ngàn rưởi (2.500).

Một vạn. Một vạn (10.000), một vạn lẻ một (10.001). Một triệu (1.000.000). Một tỷ (

Một nửa (½), một phần tư (¼), một phần trăm (1/100 hoặc 1%).

Translated, with annotations where incorrect:

0-30. Zero, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen different [where’d that “different” come from?], eighteen, nine of ten [woah, going fractional suddenly], twenty, twenty-one, twenty-two, twenty-three, from twenty [mistaking (“four”) for từ (“from”)], twenty-five, twenty-six, twenty-seven, twenty-eight, twenty-nine, thirty.

One hundred (100), an odd one hundred (101), a hundred odd years (105), a two hundred and eleven (121), three hundred (300). [An odd quirk of the Vietnamese counting system is that lẻ (“odd”) precedes the ones digit if the number is above 100.]

One thousand (1,000), an odd one thousand (1,001), ruoi two thousand (2,500).

A van. A ten thousand (10,000) [một vạn gives different results when alone], an odd one thousand (10,001). One million (1,000,000). One billion (1,000,000,000).

One-half (½), a quarter (¼), one percent (1 / 100 or 1%).

Counting is just one of those things that Google will have to hard-code into their translation software to get completely right. Statistical techniques won’t really cut it, because – well, when’t the last time anyone spelled out “eighty-six thousand, seven hundred fifty-two”.

The post title, by the way, is how you’d spell out “ten to the hundredth power” – a googol – in Vietnamese.