Any bilingual ( french- english) corpus extractor recommended?
Thread poster: BOLDXPRESS
BOLDXPRESS
BOLDXPRESS
Canada
Local time: 13:09
English to French
+ ...
Mar 25, 2016

Dear forum members,

Wondering if someone can recommend me a website or software that can help me build a corpus, ideally a bilingual corpus in a specific domain ( astronomy, law, medical)
For example I want to have around 15 000 ( french and english) words in context

Thanks for you help


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 18:09
Member (2009)
Dutch to English
+ ...
You might try these two: Mar 25, 2016

1. https://www.sketchengine.co.uk/
2. http://www.farkastranslations.com/eu_translation_memories.php

Michael


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 19:09
English to Hungarian
+ ...
Custom job Mar 25, 2016

I presume you want a sentence-aligned corpus (each FR sentence paired with the corresponding EN sentence).
Do you have the texts? It sounds like you don't, so you'll either need to find a preexisting aligned corpus/TM/database or somebody who will collect texts that meet your criteria and align them for you.
The best collection of free preexisting aligned corpora is this: http://opus.lingfil.uu.se/
... See more
I presume you want a sentence-aligned corpus (each FR sentence paired with the corresponding EN sentence).
Do you have the texts? It sounds like you don't, so you'll either need to find a preexisting aligned corpus/TM/database or somebody who will collect texts that meet your criteria and align them for you.
The best collection of free preexisting aligned corpora is this: http://opus.lingfil.uu.se/

If you can't find preexisting aligned corpora, you need an aligner to make your own. There are many out there, search the forums. I wrote one of them, called lf aligner. Of course you need to find and collect the texts first, which could require other software. Then you after alignment might want to filter out certain low quality segments, and you probably want to do some manual checks/corrections to make sure everything is good, fix errors and possibly re-do parts of the alignment. 15000 words is not a lot, so it's not that big a job but it could take a while to work out the process itself if you want to do it yourself. It all depends on whether you have the texts, how picky you are about what texts will work and how high quality you need the final corpus to be. If your input texts are crap and you need perfect results, you will tear your hair out.

Regarding finding texts in your fields: law is easy if you're not extremely picky about specific types of law, medical a wee bit harder but still fairly easy if you're not too particular (see EMEA on OPUS). Astronomy is tougher. Not sure where one would find large amounts of en-fr astronomy texts. Perhaps there are bilingual canadian astronomy journals but that's a long shot. It would be easy to collect sentences or documents that mention astronomy-related subjects by filtering them out of larger general collections, but that might not yield an awfully high quality corpus.


If you can't/don't want to do the legwork, I take jobs like this for a fee. Not sure if anyone else offers this kind of service, frankly. It's kind of a niche activity that I happen to have an interest in. You could call it a hobby that occasionally generates income. Michael linked my website above, some info is available there.

[Edited at 2016-03-25 10:30 GMT]
Collapse


 
BOLDXPRESS
BOLDXPRESS
Canada
Local time: 13:09
English to French
+ ...
TOPIC STARTER
Thanks for your input Mar 26, 2016

Thanks Andras,


What I actually meant is a site where I can find an important collection of bilingual texts (corpora). My goal is to later align them. But I first need to find the collection of texts in a specialized domain. It doesnt really matter which domain it is.

Thanks a lot


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 19:09
English to Hungarian
+ ...
Well... Mar 28, 2016

If you read back your first post, you'll see you asked for a "corpus extractor" and not for texts. If you expect to get help, it pays for you to be clear and comprehensible. Best of luck.

[Edited at 2016-03-29 07:28 GMT]


 
expressisverbis
expressisverbis
Portugal
Local time: 18:09
Member (2015)
English to Portuguese
+ ...
Have you tried Tradooit.com? Mar 28, 2016

BOLDXPRESS wrote:

Thanks Andras,


What I actually meant is a site where I can find an important collection of bilingual texts (corpora). My goal is to later align them. But I first need to find the collection of texts in a specialized domain. It doesnt really matter which domain it is.

Thanks a lot


TradooIT is a computer-assisted translation suite that includes a translation memory, a terminology bank, a bilingual concordancer, a text alignment tool, a pretranslation tool and a Word add-in.
I find it very useful:
http://www.tradooit.com/index.php
You can select your files to align them, using he "Import/Align your files" function.


 
Reed James
Reed James
Chile
Local time: 13:09
Member (2005)
Spanish to English
Synchroterm Mar 29, 2016

You do have to pay some money for this software, but it takes care of a lot of actions that can be a pain with the other free term extraction tools I've seen. It depends on your priorities and how often you need to extract terms.

 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Any bilingual ( french- english) corpus extractor recommended?







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »