Language Learning via Parallel Corpora

by David · December 1, 2017

After spending a lot of time building a solid foundation of grammar and vocabulary for the Couch to Korean Challenge, I was itching to put it all together and explore the Korean language beyond the textbook sentences and vocabulary flashcards. Don’t get me wrong, I love textbooks and flashcards, especially when they offer good examples of full sentences that help demonstrate grammar and vocabulary in context. But textbooks usually contain somewhat limited sentences, such as “Do you eat Korean food often?”

Likewise, I have been using the flashcard app Anki extensively to study the card deck Korean Grammar Sentences by Evita, which has over 2000 sentences in both Korean and English, with audio and grammar notes. This set contains a fairly comprehensive set of sentences demonstrating Korean grammar points, but it also contains many impractical, nonsense sentences like “I made jam from peanuts.”

Dual-language Books

There is nothing like coming up against actual language “in the wild” to really learn how the theoretical is put into practice, how the grammar and vocabulary are used in real sentences written by native speakers of a language. Good (written) sources for language in the wild can be newspapers, short stories, and novels. Alas, reading full newspapers or novels in a new language is often beyond my nascent capabilities. I would need to spend hours on each page, puzzling out the grammatical constructs in the complex sentences and looking up all the unfamiliar words in a dictionary. So one of my favorite resources for reading in a new language is dual-language books. These are books that contain the same text, side-by-side, in two languages. For example, I have a book of Spanish-language stories by well-known authors, where the left-hand page contains Spanish and the right-hand page contains the corresponding English. Similarly, I have a copy of Shel Silverstein’s poetry book Where the Sidewalk Ends, with each page containing an English-language poem as well as the same poem translated into Persian. I find that reading the two versions in parallel is an excellent language learning strategy (although I would not recommend poetry as an easy medium for learning a new language!). As I read each sentence (or paragraph if I’m feeling ambitious), I do my best to parse and understand the text in the language I am learning, then I can immediately refer to the corresponding English to fill in (and correct) anything I miss. I can also make note of grammar points and words that I need to practice, and then I can set aside time to review these between reading sessions. This is way more efficient (and fun) than trying to painstakingly puzzle out the text with a dictionary and grammar book.

The Rosetta Stone

In the world of linguistics (and especially computational linguistics), a dual-language book is known as a parallel text, and parallel texts are key resources for many types of linguistic analysis. The most famous parallel text is perhaps the Rosetta Stone (the rock, not the software company), which linguists used to solve the puzzle of Egyptian hieroglyphs. A large number of texts together is known as a corpus (from the Latin corpus, “body”), with the plural form corpora (NEVER “corpuses”), and corpus linguistics involves the linguistic analysis of large numbers of texts.

Parallel corpora, large collections of texts manually translated into multiple languages, are an essential ingredient in training modern-day machine translation systems (such as Google Translate and Microsoft Bing Translator), which themselves have become important resources for human language learning. In my day-to-day language learning, I rely on Google Translate much more often than any actual dictionary. The most widely-used parallel corpora for machine translation research have traditionally been those from Government organizations that produce manual translations of all their proceedings. For example, one of the earliest and most successful parallel corpora for machine translation research was the Hansard French/English corpus, which contains the official records of the Canadian Parliament for several years in the 1990s. A similar corpus from the 1990s contains UN documents in English, French, and Spanish. A larger corpus from the United Nations contains more languages (Spanish, Russian, French, English, Mandarin Chinese, and Arabic) and spans more years. These corpora are not nearly as interesting to read as Cervantes or Shel Silverstein, but computers don’t care about interesting. Machine translation systems can devour millions of parallel texts and, in the process, extract (or “learn”) patterns and vocabulary across languages.

My personal parallel corpus processing proceeds much slower than a computer, but in order to continue my progress on the Couch to Korean Challenge I decided it was time to move beyond textbooks and flashcards and finally read something fun and relatively long. Unfortunately, I couldn’t find any dual-language Korean-English books to work through. So I had to improvise and create my own dual-language resources from monolingual books. Luckily my favorite source of free resources, the Public Library, has a full shelf of Korean-language fiction ranging from original books from Korea to translations of Hemingway and The Hunger Games. I started by checking out the Korean-language versions of two books I own and have recently read in English: A Man Called Ove by Fredrik Backman and Harry Potter and the Sorceror’s Stone by J.K. Rowling. Reading the (simulated) dual-language book involved sitting with both the Korean and English versions (of one of the novels) open, and stepping through a paragraph at a time; this is known as “text alignment” in the machine translation world.

Reading two separate books in parallel was not nearly as convenient as reading an actual dual-language book, because text alignment was a constant challenge. The two versions of each novel had different text sizes and numbers of pages, so it was often difficult to find and match up the same text across the two versions. It did help that I was familiar with the plot and characters from having already read the English novels in full. Over the course of a few weeks I did successfully read about 75 pages of each of Harry Potter and Ove. I also added to the mix a book that I haven’t yet read in English: The Giver by Lois Lowry. In the process of reading parts of these three novels, I encountered many grammar patterns and vocabulary items in context. I had ostensibly already learned these in my earlier textbook/flashcard study, but stumbling upon them in the wild was much more fun and rewarding. As I read I also added a couple hundred new words to my Quizlet card deck for vocabulary review.

The Couch to Korean Challenge thus rolls on. I continue to build vocabulary and to get comfortable with more complex sentence structures, and this will only get easier as I read more. But to date all my focus has been on written Korean. I haven’t yet even attempted to speak Korean. While this has all been part of the training plan, with just 4 months remaining before our planned trip to Korea, it is finally time to move on from the written word, to embrace humility, and to start speaking some Korean!