Saturday, September 13, 2014

Easy parallel corpora from Wikipedia

We're off to a busy start of the semester, and between co-teaching a new (for me) class, proposals, project work, and students returning from internships, I haven't had much capacity for extracurricular writing.

But, I wanted to post a link to some scripts I just pushed to Github that will build a parallel corpus based by extracting the titles from the interlingual links on Wikipedia. I've found Wikipedia title pairs to be a surprisingly useful resource on a number of occasions (great coverage of interesting languages and scripts, good license for data use/distribution), and I imagine others will as well.

Unknown said...

This could be a potential resource for the our multilingual entity project!