Pages in topic:   [1 2] >
Counting words in a txt file within quotation marks
Thread poster: Afew
Afew
Afew  Identity Verified
Kazakhstan
Local time: 04:48
English to Kazakh
Feb 9, 2012

Hello fellow translators,

I have a txt file with software strings in it to be localized. It looks like:

#command some text "text to be localized" // comment

I want to count the words within quotation marks. Is there any way to do it, except manual counting?

I tried importing txt to MS Excel, but it seems the file is not correctly delimited. So, the words I need may appear on different columns.

Any help will be much appreciated.


 
Philip Lees
Philip Lees  Identity Verified
Greece
Local time: 01:48
Greek to English
A job for Perl Feb 9, 2012

Give the file to somebody you know who uses the Perl programming language, and ask them to run this:

perl -i.bak -pe "s/^.+?\"//; s/\".+$//" yourfilename

That will remove everything from your file except the parts in quotes (the original file will be renamed as yourfilename.bak). You can then count the words in the new file.

This assumes that all the lines have the same format.


 
Afew
Afew  Identity Verified
Kazakhstan
Local time: 04:48
English to Kazakh
TOPIC STARTER
Some strings are different Feb 9, 2012

Thanks Philip,

Unfortunately, some lines contain only comments and there are lines that contain #command... and "text" but no comments.

I was able to count the number of quotation marks in excel using countif function but it was useless, since there are lines with sentences in quotation marks.


 
Amit Evron
Amit Evron  Identity Verified
Vietnam
Local time: 05:48
Spanish to English
+ ...
Send it over Feb 9, 2012

If it's not confidential and if the file isn't too big, feel free to send it over and I'll write a quick perl script. Shouldn't take more than 5 minutes. Just send me a message through Proz and I'll reply with my e-mail address.

 
Tony M
Tony M
France
Local time: 00:48
Member
French to English
+ ...
SITE LOCALIZER
Paste into Word Feb 9, 2012

Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " (careful to get the right character!), replacing with (say) Tab

Select all and convert text to table, using the character you replaced above (e.g. Tab) as the delimiter.

This should enable you to get a column that just has your text to be translated in, and you can take it from there

If you have any lines with no
... See more
Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " (careful to get the right character!), replacing with (say) Tab

Select all and convert text to table, using the character you replaced above (e.g. Tab) as the delimiter.

This should enable you to get a column that just has your text to be translated in, and you can take it from there

If you have any lines with no " " at all, they should just appear all in the first column.

Theoretically at least, you ought to be able to reverse the process at the end...

One proviso: one has to assume that each line does end with a Return character or similar; if necessary, you might need to go through and replace whatever the end-of-line delimiter is with something that will work in Word for the conversion to table.
Collapse


 
Philip Lees
Philip Lees  Identity Verified
Greece
Local time: 01:48
Greek to English
Should still work Feb 9, 2012

Nurzhan Nagashbekov wrote:
Unfortunately, some lines contain only comments and there are lines that contain #command... and "text" but no comments.


I think my script should still work with a small modification (for the comment only lines), but as Amit has kindly offered to take it on I'm happy to hand over to him.


 
Afew
Afew  Identity Verified
Kazakhstan
Local time: 04:48
English to Kazakh
TOPIC STARTER
Thanks for suggestions! Feb 9, 2012

Amit Evron wrote:

If it's not confidential and if the file isn't too big, feel free to send it over and I'll write a quick perl script. Shouldn't take more than 5 minutes. Just send me a message through Proz and I'll reply with my e-mail address.


It is confidential


 
Afew
Afew  Identity Verified
Kazakhstan
Local time: 04:48
English to Kazakh
TOPIC STARTER
This may work... Feb 9, 2012

Tony M wrote:

Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " ....



Thanks Tony, I will try your method.


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 00:48
Member (2004)
English to Polish
SITE LOCALIZER
Okapi Rainbow Feb 9, 2012

I think the best option would be to use Okapi Rainbow, especially if you expect more such work form the client. Basically, it would allow you to extract the text you require (using regular expressions) and then calculate the wordcount.

Trados 2007 also has an option to import text based on regular expressions. You have to use a separate application Filter Settings for this. After the import you just analyze the resulting ttx file as usual.

I realize that having to learn
... See more
I think the best option would be to use Okapi Rainbow, especially if you expect more such work form the client. Basically, it would allow you to extract the text you require (using regular expressions) and then calculate the wordcount.

Trados 2007 also has an option to import text based on regular expressions. You have to use a separate application Filter Settings for this. After the import you just analyze the resulting ttx file as usual.

I realize that having to learn regular expressions might seem daunting, but if you plan to translate such texts it will be a sensible investment of your time...
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 00:48
English to Hungarian
+ ...
CAT Feb 9, 2012

I fervently hope that you'll be using a CAT for this job. The localization of SW strings requires strict formatting consistency and there are a lot of repetitions etc., so it' really not the job you'd want to do by typing over the original.
Now, If you do use a CAT, just do the word count there.
Studio has the required capabilities (i.e. you can specify regex rules that separate the translatable text from the rest), and the Studio package also comes with a specialized sw localization
... See more
I fervently hope that you'll be using a CAT for this job. The localization of SW strings requires strict formatting consistency and there are a lot of repetitions etc., so it' really not the job you'd want to do by typing over the original.
Now, If you do use a CAT, just do the word count there.
Studio has the required capabilities (i.e. you can specify regex rules that separate the translatable text from the rest), and the Studio package also comes with a specialized sw localization tool (Passolo). Of course there are lots of other tools that'll work, too.

The more interesting question is: who is in charge of this project? Isn't there a PM/client who sorts these things out before you get involved?
Collapse


 
Philip Lees
Philip Lees  Identity Verified
Greece
Local time: 01:48
Greek to English
Try this Feb 9, 2012

I had a few minutes to spare, so I set up this:

http://quote.writewords.eu/

If you paste your text in the box and click Submit, it should return you only the stuff that's between quotes.


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 00:48
English to Hungarian
+ ...
perl regex Feb 9, 2012

Philip Lees wrote:

Give the file to somebody you know who uses the Perl programming language, and ask them to run this:

perl -i.bak -pe "s/^.+?\"//; s/\".+$//" yourfilename

That will remove everything from your file except the parts in quotes (the original file will be renamed as yourfilename.bak). You can then count the words in the new file.

This assumes that all the lines have the same format.


It also assumes that there is only one pair of quotes in one line and that there are no escaped quotes inside quoted strings. It'll fail with lines like this:
StringID:4567267; text:"Press the \"Browse\" button to pick a file"; Button:"Browse"
And it doesn't skip lines that have no translatable content at all.

Also, .+? is better written as .* and the " may very well be the last character on the line so .+$// should be .*$//.

So, I'd rewrite your one-liner as:
perl -i.bak -pe "s/^.*\"(.*)\".*$/$1/" yourfilename

...but this still doesn't handle the problem cases I mentioned above.
You could do this (untested) to delete lines that don't contain any quoted string:

perl -i.bak -pe "next unless /\".*\"/; s/^.*\"(.*)\".*$/$1/" yourfilename

... but the bottom line is, it's still only usable if the input file is "simple". You could add negative lookahead/lookbehind to cater for escaped quotes inside the quoted strings etc. to make it work and then somehow adapt it for multiple strings per line, but it starts to get tricky there, and you need to see the input file (or know its spec) to take a reasonable stab at solving the problem.

[Edited at 2012-02-09 10:54 GMT]


 
Afew
Afew  Identity Verified
Kazakhstan
Local time: 04:48
English to Kazakh
TOPIC STARTER
Initial stage of the project Feb 9, 2012

I am at the very beginning of the project and just wanted to know what is the wordcount for now. I will definitely try regex. Thanks!

 
Philip Lees
Philip Lees  Identity Verified
Greece
Local time: 01:48
Greek to English
Nobody's perfect Feb 9, 2012

FarkasAndras wrote:

It also assumes that there is only one pair of quotes in one line and that there are no escaped quotes inside quoted strings. It'll fail with lines like this:
StringID:4567267; text:"Press the \"Browse\" button to pick a file"; Button:"Browse"



Oh, sure, it breaks in lots of cases, as does the simpler match I used on the web version:

/"(.+?)"/

I am well aware of the pitfalls of text parsing, which is why I added the caveat about all lines having the same format as the example provided.

As this is not a Perl or a regex forum, I'll leave it at that.


 
Ambrose Li
Ambrose Li  Identity Verified
Canada
Local time: 18:48
English
+ ...
simplier perl code Feb 9, 2012

I think this single line of perl should suffice:

perl -nle 'print $1 if /#command\s+"([^"]*)"/'

This assumes that double quotation can’t occur inside the pair of double quotation marks that marks the string to be translated. Usually this is not the case and (assuming that " is escaped with a single backslash) the perl needed will more likely be

perl -nle 'print $1 if /#command\s+"((?:\\"|[^"])*)"/'

Of course, if escaping of quotation marks oc
... See more
I think this single line of perl should suffice:

perl -nle 'print $1 if /#command\s+"([^"]*)"/'

This assumes that double quotation can’t occur inside the pair of double quotation marks that marks the string to be translated. Usually this is not the case and (assuming that " is escaped with a single backslash) the perl needed will more likely be

perl -nle 'print $1 if /#command\s+"((?:\\"|[^"])*)"/'

Of course, if escaping of quotation marks occurs but is not signalled by backslashes then the perl code needed will be different.

ETA: The above assumes that continuations don’t occur. If continuations do occur the above won’t work and one-liner solutions might not be sufficient…

[Edited at 2012-02-09 19:03 GMT]
Collapse


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Counting words in a txt file within quotation marks







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »