Pages in topic: [1 2] > | How to extract content from SGML to create TMX file Thread poster: gianghl1983
| gianghl1983 Vietnam Local time: 05:20 English to Vietnamese
Dear ProZ users, Recently, I got a bilingual text in SGML as below(about 40.000 English-Vietnamese sentences). In the code below, I changed < and > symbol by [[[ and ]]] accordingly. I want to put them in my TM but could not find a way to convert this file to TMX. Is there anyone know a solution for this. Thank you! ------------------------------------------ [[[doc id='N0001']]] [[[head]]] [[[title]]]What is a Fen... See more Dear ProZ users, Recently, I got a bilingual text in SGML as below(about 40.000 English-Vietnamese sentences). In the code below, I changed < and > symbol by [[[ and ]]] accordingly. I want to put them in my TM but could not find a way to convert this file to TMX. Is there anyone know a solution for this. Thank you! ------------------------------------------ [[[doc id='N0001']]] [[[head]]] [[[title]]]What is a Fenqing ?[[[/title]]] [[[corpus url='http://code.google.com/p/evbcorpus/']]]EVBCorpus[[[/corpus]]] [[[author email='[email protected]']]]Quoc-Hung Ngo, Werner Winiwarter[[[/author]]] [[[citation]]]Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160, Ha Noi, Vietnam[[[/citation]]] [[[/head]]] [[[text]]] [[[spair id='1']]] [[[s id='en1']]]What is a Fenqing ?[[[/s]]] [[[s id='vn1']]]Fenqing là gì ?[[[/s]]] [[[/spair]]] [[[spair id='2']]] [[[s id='en2']]]Fenqing is a Chinese word which literally means " angry youth " .[[[/s]]] [[[s id='vn2']]]Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " .[[[/s]]] [[[/spair]]] [[[spair id='3']]] [[[s id='en3']]]This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .[[[/s]]] [[[s id='vn3']]]Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .[[[/s]]] [[[/spair]]] [[[spair id='4']]] .... [[[/text]]] [[[/doc]]] ▲ Collapse | | | Michael Beijer United Kingdom Local time: 23:20 Member (2009) Dutch to English + ... a few options | Sep 16, 2015 |
Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half. Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your English half. This can easily be done in EmEditor (using the Filter Toolbar). Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little... See more Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half. Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your English half. This can easily be done in EmEditor (using the Filter Toolbar). Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little tool in LF Aligner's "grab bag" (available on sourceforge) that can do this. Or just copy paste them both into a new file. Then use e.g. the open source Heartsome TMX editor to convert this into a TMX (Tools > Convert to TMX). Or, let me do it for you (for £30/hour)
[Edited at 2015-09-16 09:59 GMT] ▲ Collapse | | | Solution, sort of | Sep 16, 2015 |
You'll have to mark-down it first, I suppose. I used MDEdit for it, but there are dozens of (free) mark-down apps. You'll then have to convert the resulting bitext to TMX, which you can do in CafeTran, probably in other CAT tools as well. I hope somebody comes up with an easier solution, because it looks like it'll need some more editing. Cheers, Hans (who loves problems, other people's problems) | | | MS Word text to table convert | Sep 16, 2015 |
Michael Beijer wrote: Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half. Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your Vietnamese [sic. English] half. This can easily be done in EmEditor (using the Filter Toolbar). I love the simple MS Word text to table converter. 1. I first convert the source (Vietnamese?) part into a table by using MS Word. 2. Insert an entire column to the right. Copy and paste every cell on the right with English texts. 3. Select the table and convert to text. save as text (Unicode coding). 4. Make translatable Trados bilingual file (tab delimited format with presetting of target as confirm.) 5. Then I get the TM or TMX as desire. Soonthon L. | |
|
|
Post removed: This post was hidden by a moderator or staff member for the following reason: Blank post | 2nl (X) Netherlands Local time: 00:20
Use a good editor that can delete lines that contain a certain search string (e.g. TextWrangler for Mac, I'm positive that similar editors exist for other operating systems). Delete all lines that contain "garbage", so that you keep what's valuable: [[[doc id='N0001']]] [[[text]]] [[[spair id='1']]] [[[s id='en1']]]What is a Fenqing ?[[[/s]]] [[[s id='vn1']]]Fenqing là gì ?[[[/s]]] [[[/spair]]] [[[spair id='2']]] [[[s id='en2'... See more Use a good editor that can delete lines that contain a certain search string (e.g. TextWrangler for Mac, I'm positive that similar editors exist for other operating systems). Delete all lines that contain "garbage", so that you keep what's valuable: [[[doc id='N0001']]] [[[text]]] [[[spair id='1']]] [[[s id='en1']]]What is a Fenqing ?[[[/s]]] [[[s id='vn1']]]Fenqing là gì ?[[[/s]]] [[[/spair]]] [[[spair id='2']]] [[[s id='en2']]]Fenqing is a Chinese word which literally means " angry youth " .[[[/s]]] [[[s id='vn2']]]Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " .[[[/s]]] [[[/spair]]] [[[spair id='3']]] [[[s id='en3']]]This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .[[[/s]]] [[[s id='vn3']]]Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .[[[/s]]] [[[/spair]]] [[[spair id='4']]] .... [[[/text]]] [[[/doc]]] Perform many Find and Replace operations to convert the tags between [[[ and ]]] to their "corresponding" TMX tags: truetruefalsefalsefalsefalse10true-1size=3 What is a Fenqing ?Fenqing là gì ? Fenqing is a Chinese word which literally means " angry youth " .Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " . This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .
[Edited at 2015-09-16 18:08 GMT] ▲ Collapse | | | Michael Beijer United Kingdom Local time: 23:20 Member (2009) Dutch to English + ... I don't see how this will work. | Sep 16, 2015 |
Meta Arkadia wrote: You'll have to mark-down it first, I suppose. I used MDEdit for it, but there are dozens of (free) mark-down apps. You'll then have to convert the resulting bitext to TMX, which you can do in CafeTran, probably in other CAT tools as well. I hope somebody comes up with an easier solution, because it looks like it'll need some more editing. Cheers, Hans (who loves problems, other people's problems) I don't see how this will work. If you remove all the markup, you have also removed the info you need to convert it into its two languages. The text in your screenshot is all run together. How are you going to turn that into vn-en? Michael | | | I like these puzzles too... | Sep 16, 2015 |
... maybe try this if you have Studio. Create a filetype for this sgml... simple with two rules: //s (always translatable) //* (Don't translate) Then open the file in Studio and save it. Now you have an SDLXLIFF with source/target repeated in the source column only thorughout the file. Now use the SDLXLIFF Converter for MSOffice (installed with Studio since 2011) and convert the SDLXLIFF to Excel. Now you have an excel file ... See more ... maybe try this if you have Studio. Create a filetype for this sgml... simple with two rules: //s (always translatable) //* (Don't translate) Then open the file in Studio and save it. Now you have an SDLXLIFF with source/target repeated in the source column only thorughout the file. Now use the SDLXLIFF Converter for MSOffice (installed with Studio since 2011) and convert the SDLXLIFF to Excel. Now you have an excel file with an ID column, a source column (populated) and a target column (and an empty notes column). Use this formulae in the target column: =IF(ISEVEN(A3),B3,"") This will look at the ID column (column A) and check if's an even number or not. If it is then it will copy the contents in the cell. If it's an odd number it puts nothing. Once you did this copy the formulae down the spreadsheet. Now copy all of column C (target column) and paste as plain text to remove the formulae. Now you have a spreadheet with every other row containing source on the left and target on the right. So filter on the target column and sort in alphabetical order. Now just delete all the rows with nothing in the target. Now you have a simple spreadsheet you can drag into the Glossary Converter and convert to TMX. Worked nicely, and easily, with your sample text Regards Paul SDL Community Support ▲ Collapse | |
|
|
Bitext and regex | Sep 16, 2015 |
Michael Beijer wrote: I don't see how this will work. If you remove all the markup, you have also removed the info you need to convert it into its two languages. The text in your screenshot is all run together. How are you going to turn that into vn-en? It looks like bitext, so the lot should be proceeded by the code. Unfortunately, I don't know how to do that, especially not for Vietnamese. We'll have to ask Andras. SDL Community wrote: This will look at the ID column (column A) and check if's an even number or not. If it is then it will copy the contents in the cell. If it's an odd number it puts nothing. With my first try, I got something similar, if not easier: However, I encountered encoding problems that I didn't want to try to solve because I used Mac-only apps nobody else uses, and my Vietnamese is lousy at that. How did you get rid of all those superfluous spaces? Did you regex them away, or did they disappear automagically? What are they doing there anyway? Cheers, Hans | | |
Meta Arkadia wrote: proceeded by the code. Unfortunately, I don't know how to do that, especially not for Vietnamese ... and if I edit it, it'll have to be vetted again. They don't trust me over here. And right they are. H. | | | gianghl1983 Vietnam Local time: 05:20 English to Vietnamese TOPIC STARTER Thank you all! | Sep 16, 2015 |
Michael Beijer wrote: Extract every line from your file that contains " s id='vn" and save it as a new txt file. This is your Vietnamese half. Extract every line from your file that contains " s id='en" and save it as a new txt file. This is your English half. This can easily be done in EmEditor (using the Filter Toolbar). Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little tool in LF Aligner's "grab bag" (available on sourceforge) that can do this. Or just copy paste them both into a new file. Then use e.g. the open source Heartsome TMX editor to convert this into a TMX ( Tools > Convert to TMX). Or, let me do it for you (for £30/hour) [Edited at 2015-09-16 09:59 GMT] Thank you all, I followed Michael Beijer method with EmEditor and I can easily extract text content separately into Vietnamese and English. Many thanks! | | | I didn't have to ;-) | Sep 16, 2015 |
Meta Arkadia wrote: How did you get rid of all those superfluous spaces? Did you regex them away, or did they disappear automagically? What are they doing there anyway? There weren't any in my excel file. Finished TM in Studio looks ok too... no encoding issues. Regards Paul SDL Community Support | |
|
|
Platary (X) Local time: 00:20 German to French + ... A little Macro... | Sep 16, 2015 |
to clean the DOC : Sub RemoveTags() Dim MyRange As Range Dim pos As Long Set MyRange = ActiveDocument.Range With MyRange.Find Do While .Execute(findText:="(\ < * \ >)", _ MatchWildcards:=True, _ Wrap:=wdFindStop, Forward:=True) = True MyRange.Delete Loop End With End Sub Then you have a "clean" text : just remove (or correct) what you want (if need... See more to clean the DOC : Sub RemoveTags() Dim MyRange As Range Dim pos As Long Set MyRange = ActiveDocument.Range With MyRange.Find Do While .Execute(findText:="(\ < * \ >)", _ MatchWildcards:=True, _ Wrap:=wdFindStop, Forward:=True) = True MyRange.Delete Loop End With End Sub Then you have a "clean" text : just remove (or correct) what you want (if needed). Select the text and convert it in a table. Use one of the tools already mentioned to create a TMX. Done! What is a Fenqing ? EVBCorpus Quoc-Hung Ngo, Werner Winiwarter Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160, Ha Noi, Vietnam What is a Fenqing ? Fenqing là gì ? Fenqing is a Chinese word which literally means " angry youth " . Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " . This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men . Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận . Regards
[Modifié le 2015-09-16 20:07 GMT] ▲ Collapse | | | Michael Beijer United Kingdom Local time: 23:20 Member (2009) Dutch to English + ... I ♥ EmEditor | Sep 16, 2015 |
gianghl1983 wrote: Michael Beijer wrote: Extract every line from your file that contains " s id='vn" and save it as a new txt file. This is your Vietnamese half. Extract every line from your file that contains " s id='en" and save it as a new txt file. This is your English half. This can easily be done in EmEditor (using the Filter Toolbar). Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little tool in LF Aligner's "grab bag" (available on sourceforge) that can do this. Or just copy paste them both into a new file. Then use e.g. the open source Heartsome TMX editor to convert this into a TMX ( Tools > Convert to TMX). Or, let me do it for you (for £30/hour) [Edited at 2015-09-16 09:59 GMT] Thank you all, I followed Michael Beijer method with EmEditor and I can easily extract text content separately into Vietnamese and English. Many thanks! Yes, that Filter Toolbar in EmEditor is priceless. Note that it also allows you to filter negatively. So much quicker and easier than messing around with Macros. The whole thing would take maybe 5 minutes in EmEditor. | | | Yes there are | Sep 17, 2015 |
SDL Community wrote: There weren't any in my excel file. Finished TM in Studio looks ok too And I only have those encoding issues in a crazy editor, and I'm sure I can avoid them, but I'll have to find out how.* It's worth the trouble because of all the (AppleScript) goodies. But those spaces are everywhere... Anyway, Michael's solution works. No need to think. Good. *EDIT: Open as *.rtf does the trick. Cheers, Hans
[Edited at 2015-09-17 02:28 GMT] | | | Pages in topic: [1 2] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » How to extract content from SGML to create TMX file Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
| CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |