Pages in topic:   [1 2] >
Euro Commission multilingual TM for the Acquis Communautaire publicly accessible
Thread poster: Clive Phillips
Clive Phillips
Clive Phillips  Identity Verified
United Kingdom
Local time: 07:41
Member (2009)
German to English
+ ...
Apr 29, 2012

Apologies if the following has already been posted. (I searched in vain.)

"Since November 2007 the European Commission's Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information.

The Acquis Communautaire is the entire body of European legisl
... See more
Apologies if the following has already been posted. (I searched in vain.)

"Since November 2007 the European Commission's Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information.

The Acquis Communautaire is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 22 official languages. As a result, the Acquis now exists as parallel texts in the following 22 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. For the 23rd official EU language, Irish, the Acquis is not translated on a regular basis; which is why DGT-TM does not include data in Irish.
...
The first version of DGT-TM was released in 2007 and included documents published up to the year 2006. The currently latest version of DGT-TM (released in April 2012, but referred to as DGT-TM-2011), contains additional documents published from 2004 to 2010. While the alignments between TUs and their translations were verified manually for DGT-TM-2007, the TUs in DGT-TM-2011 were aligned automatically. The data format is the same for both releases."

For details and to download, see http://langtech.jrc.it/DGT-TM.html#Download.
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:41
English to Hungarian
+ ...
Well known Apr 29, 2012

It's been posted dozens of times. When you want to see if something has been discussed in the forum google something like this:
site:proz.com langtech.jrc.it
or:
site:proz.com/forum "dgt tm"
or:
site:proz.com acquis TM corpus

etc.


 
EHI (X)
EHI (X)
Local time: 08:41
this is indeed old news May 1, 2012

....but it would be interesting to know whether they will release new content sometime in the future.

 
Clive Phillips
Clive Phillips  Identity Verified
United Kingdom
Local time: 07:41
Member (2009)
German to English
+ ...
TOPIC STARTER
New content released on 13.04.2012 - surely not old news? May 2, 2012

Lutz, FarkasAndras,

The new release was on 13 April 2012. Over 66% is content not publicly accessible previously.

The data have tripled in size, compared with the 2007 release.

According to Juliette Scott's blog http://wordstodeeds.com/tag/dgttm/ :

"Whereas the first version released in 2007 included documents published up to 2006, the current files g
... See more
Lutz, FarkasAndras,

The new release was on 13 April 2012. Over 66% is content not publicly accessible previously.

The data have tripled in size, compared with the 2007 release.

According to Juliette Scott's blog http://wordstodeeds.com/tag/dgttm/ :

"Whereas the first version released in 2007 included documents published up to 2006, the current files go up to 2010. Just to make things really clear (!), the data goes to 2010, the files are called DGT-TM-2011 and they were released in April 2012!!"

Clive
Collapse


 
EHI (X)
EHI (X)
Local time: 08:41
Thanks Clive! May 2, 2012

That's good to know. Thanks for the link to the blog.

 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:41
English to Hungarian
+ ...
New May 2, 2012

Clive Phillips wrote:

Lutz, FarkasAndras,

The new release was on 13 April 2012. Over 66% is content not publicly accessible previously.


Oh, I see. You didn't mention the fact that there is a new release in your thread title or the original post, and I didn't read it thoroughly enough to notice the "released in April 2012" bit. I actually opened the JRC site as well but didn't notice the news there, either. Anyway, thanks for the heads up.

It's interesting to know that the DGT-TM 2007 was a manually corrected alignment - I always thought it was autoaligned. They also say they'll release new data every year - now that they've switched to pure automatic alignment, that's obviously very easy to do. Manually checking ~20 million segments must have taken months and months and months even for a sizable team.

[Edited at 2012-05-02 06:29 GMT]


 
Stefan de Boeck (X)
Stefan de Boeck (X)  Identity Verified
Belgium
Local time: 08:41
English to Dutch
+ ...
any pointers out there May 2, 2012

any pointers out there on how to turn this thing into something that can actually be used?

I’ve turned 2004_1 and 2010_4 into a single TM and am staring at some 100,000 TUs.

the lot of it promises over 20,000,000 TUs...

it seems too good to simply ignore, and too big to actually use.

or is it?


 
Clive Phillips
Clive Phillips  Identity Verified
United Kingdom
Local time: 07:41
Member (2009)
German to English
+ ...
TOPIC STARTER
Any pointers? May 3, 2012

Stefan
I'll leave it for others to respond to you. I have postponed a download until I replace my creaking ageing PC!

In [email protected] , Safetex has commented as follows:

"This massive data base is for me a big disappointment.
Almost 2 million TU's for English to French but once you manage to take out duplicates using Olifant, only half are left (not easy to do as the TU's are too numero
... See more
Stefan
I'll leave it for others to respond to you. I have postponed a download until I replace my creaking ageing PC!

In [email protected] , Safetex has commented as follows:

"This massive data base is for me a big disappointment.
Almost 2 million TU's for English to French but once you manage to take out duplicates using Olifant, only half are left (not easy to do as the TU's are too numerous to handle in one go and so has to be done many times on individual files).

But the problems don't stop there
1 Many sentences remain where the source language is 'different' cos of a comma, space, date etc.
2 The alignment tool clearly hasn't worked. Here is a typical example:
"Analyseur de signaux" (3): appareil capable de mesurer et d'afficher les propriétés fondamentales de chaque composante de fréquence d'un signal multifréquences.
"CE" is equivalent to "computing element".
"CEP" (circle of equal probability) (7) is a measure of accuracy; the radius of the circle centred at the target, at a specific range, in which 50 % of the payloads impact.
3 Hundred of thousands of 'useless' TU's where one or two would give us the answer e.g. Article 12, paragraphe 10 = Article 12(10)
The above is given thousands of times cos of changes in numbers and it is the same for many other entries.

With my naked eye, I reckon that if I had the time, I could take out manually another 60 - 70% of what is left at the moment to get just as an effective TM and one which is not too big to manipulate. But that means taking out around 500,000 TU's manually (by flagging and then deleting them)

I'd be interested to hear what others think."
Collapse


 
Stefan de Boeck (X)
Stefan de Boeck (X)  Identity Verified
Belgium
Local time: 08:41
English to Dutch
+ ...
the wet season May 3, 2012

Clive Phillips wrote:
I have postponed a download until I replace my creaking ageing PC...


Well, very much of it seems to be about fishing anyway...
Save it for a rainy day.

Cheers!


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:41
English to Hungarian
+ ...
Empty ramblings May 3, 2012

Clive Phillips wrote:

Stefan
I'll leave it for others to respond to you. I have postponed a download until I replace my creaking ageing PC!

In [email protected] , Safetex has commented as follows:

"This massive data base is for me a big disappointment.
Almost 2 million TU's for English to French but once you manage to take out duplicates using Olifant, only half are left (not easy to do as the TU's are too numerous to handle in one go and so has to be done many times on individual files).

But the problems don't stop there
1 Many sentences remain where the source language is 'different' cos of a comma, space, date etc.
2 The alignment tool clearly hasn't worked. Here is a typical example:
"Analyseur de signaux" (3): appareil capable de mesurer et d'afficher les propriétés fondamentales de chaque composante de fréquence d'un signal multifréquences.
"CE" is equivalent to "computing element".
"CEP" (circle of equal probability) (7) is a measure of accuracy; the radius of the circle centred at the target, at a specific range, in which 50 % of the payloads impact.
3 Hundred of thousands of 'useless' TU's where one or two would give us the answer e.g. Article 12, paragraphe 10 = Article 12(10)
The above is given thousands of times cos of changes in numbers and it is the same for many other entries.

With my naked eye, I reckon that if I had the time, I could take out manually another 60 - 70% of what is left at the moment to get just as an effective TM and one which is not too big to manipulate. But that means taking out around 500,000 TU's manually (by flagging and then deleting them)

I'd be interested to hear what others think."


These complaints don't make a lot of sense, to be honest. Yes, inevitably, there is a fair bit of garbage in TMs like this. Yes, there are misaligned segments in there. If I had to make guess based on the couple of hundred hours I've spent working with autoaligned EU documents, I'd say their ratio is in the 2% range. Yes, there are repetitions and near-repetitions. (Obviously, the 60-70% figure is grossly overestimated.) Why should anyone care about any of that? Nobody reads a TM of two million TUs, so why would I mind that there are 15000 segments that say "Article 10"? I just do concordance searches, find the term I'm looking for and go on my merry way. Whatever doesn't come up in lookups/concordance searches is irrelevant.
If the full TM is too large for your CAT+hardware, just use a smaller subset of the data.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 07:41
Member (2009)
Dutch to English
+ ...
I agree with András. May 3, 2012

I simply cut the big 1-point-sth GB Dutch-English TM produced by TMXtract.exe into two equal halves, using a text editor (UltraEdit), and then imported them into memoQ.

Michael

[Edited at 2012-05-03 20:07 GMT]


 
Richard Hill
Richard Hill  Identity Verified
Mexico
Local time: 01:41
Member (2011)
Spanish to English
TMXtract May 25, 2012

Rather than using a text editor I ran TMXtract three times, extracting the first 6 files, the next 6 and then the last 13 as Xbench couldn't handle it all in one go; then joined them into one TM in Studio and also joined the resulting TM with all my other TMs to make a massive Autosuggest dictionary. All working fine

 
MikeTrans
MikeTrans
Germany
Local time: 08:41
Italian to German
+ ...
What to download exactly if you have already the 2007 release? Jun 8, 2012

Hello,

thank you, Clive, for bringing to our attention the new release upload.
I've downloaded the 2007 Release years ago and I'm very satisfied with both the technical aspect of getting the content into a workable TMX, and of course of the content's quality itself.

For not to download the content twice, I want to get sure:

Vol_2004_1
...
Vol_2006_5

... all this is new content, not yet contained in the 2007 Release found in
... See more
Hello,

thank you, Clive, for bringing to our attention the new release upload.
I've downloaded the 2007 Release years ago and I'm very satisfied with both the technical aspect of getting the content into a workable TMX, and of course of the content's quality itself.

For not to download the content twice, I want to get sure:

Vol_2004_1
...
Vol_2006_5

... all this is new content, not yet contained in the 2007 Release found in
http://optima.jrc.it/Acquis/DGT_TU_1.0/data/

Is this right? So, to update my 2007 release, I've to download the entire list, although I would only use 3 languages into German ?

Thanks very much for clarifying!
Greets,
Mike

[Edited at 2012-06-08 12:02 GMT]
Collapse


 
Clive Phillips
Clive Phillips  Identity Verified
United Kingdom
Local time: 07:41
Member (2009)
German to English
+ ...
TOPIC STARTER
What to download exactly if you have already the 2007 release? Jun 8, 2012

Hi Mike,
I will leave it to others better informed and more experienced than myself, to respond to your query.
Clive


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:41
English to Hungarian
+ ...
Check Jun 8, 2012

MikeTrans wrote:

Hello,

thank you, Clive, for bringing to our attention the new release upload.
I've downloaded the 2007 Release years ago...

So, to update my 2007 release, I've to download the entire list, although I would only use 3 languages into German ?

It's possible (though unlikely) that they updated the 1958-2007 files as well as adding new material. I would download the whole new release and compare the files to the old release. Right click/Properties and see the file size in bytes. If the file sizes for the old stuff are exactly the same in the new release down to the byte, then they made no changes.


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Euro Commission multilingual TM for the Acquis Communautaire publicly accessible







Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »