We're off to a busy start of the semester, and between co-teaching a new (for me) class, proposals, project work, and students returning from internships, I haven't had much capacity for extracurricular writing.
But, I wanted to post a link to some scripts I just pushed to Github that will build a parallel corpus based by extracting the titles from the interlingual links on Wikipedia. I've found Wikipedia title pairs to be a surprisingly useful resource on a number of occasions (great coverage of interesting languages and scripts, good license for data use/distribution), and I imagine others will as well.
1 comment:
This could be a potential resource for the our multilingual entity project!
Post a Comment