Counting words in a txt file within quotation marks (Translator resources)

Translation - art & business » Translator resources »
Counting words in a txt file within quotation marks
Track this topic

Pages in topic: [1 2] >

Counting words in a txt file within quotation marks

Thread poster: Afew

Afew

Kazakhstan
Local time: 04:48
English to Kazakh

Feb 9, 2012

Hello fellow translators,

I have a txt file with software strings in it to be localized. It looks like:

#command some text "text to be localized" // comment

I want to count the words within quotation marks. Is there any way to do it, except manual counting?

I tried importing txt to MS Excel, but it seems the file is not correctly delimited. So, the words I need may appear on different columns.

Any help will be much appreciated.

Philip Lees

Greece
Local time: 01:48
Greek to English

A job for Perl

Feb 9, 2012

Give the file to somebody you know who uses the Perl programming language, and ask them to run this:

perl -i.bak -pe "s/^.+?\"//; s/\".+$//" yourfilename

That will remove everything from your file except the parts in quotes (the original file will be renamed as yourfilename.bak). You can then count the words in the new file.

This assumes that all the lines have the same format.

Afew

Kazakhstan
Local time: 04:48
English to Kazakh

TOPIC STARTER

Some strings are different

Feb 9, 2012

Thanks Philip,

Unfortunately, some lines contain only comments and there are lines that contain #command... and "text" but no comments.

I was able to count the number of quotation marks in excel using countif function but it was useless, since there are lines with sentences in quotation marks.

Amit Evron

Vietnam
Local time: 05:48
Spanish to English
+ ...

Send it over

Feb 9, 2012

If it's not confidential and if the file isn't too big, feel free to send it over and I'll write a quick perl script. Shouldn't take more than 5 minutes. Just send me a message through Proz and I'll reply with my e-mail address.

Tony M
France
Local time: 00:48
Member
French to English
+ ...

SITE LOCALIZER

Paste into Word

Feb 9, 2012

Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " (careful to get the right character!), replacing with (say) Tab

Select all and convert text to table, using the character you replaced above (e.g. Tab) as the delimiter.

This should enable you to get a column that just has your text to be translated in, and you can take it from there

If you have any lines with no " " at all, they should just appear all in the first column.

Theoretically at least, you ought to be able to reverse the process at the end...

One proviso: one has to assume that each line does end with a Return character or similar; if necessary, you might need to go through and replace whatever the end-of-line delimiter is with something that will work in Word for the conversion to table. ▲ Collapse

Philip Lees

Greece
Local time: 01:48
Greek to English

Should still work

Feb 9, 2012

Nurzhan Nagashbekov wrote:
Unfortunately, some lines contain only comments and there are lines that contain #command... and "text" but no comments.

I think my script should still work with a small modification (for the comment only lines), but as Amit has kindly offered to take it on I'm happy to hand over to him.

Afew

Kazakhstan
Local time: 04:48
English to Kazakh

TOPIC STARTER

Thanks for suggestions!

Feb 9, 2012

Amit Evron wrote:

If it's not confidential and if the file isn't too big, feel free to send it over and I'll write a quick perl script. Shouldn't take more than 5 minutes. Just send me a message through Proz and I'll reply with my e-mail address.

It is confidential

Afew

Kazakhstan
Local time: 04:48
English to Kazakh

TOPIC STARTER

This may work...

Feb 9, 2012

Tony M wrote:

Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " ....

Thanks Tony, I will try your method.

Jaroslaw Michalak

Poland
Local time: 00:48
Member (2004)
English to Polish

SITE LOCALIZER

Okapi Rainbow

Feb 9, 2012

I think the best option would be to use Okapi Rainbow, especially if you expect more such work form the client. Basically, it would allow you to extract the text you require (using regular expressions) and then calculate the wordcount.

Trados 2007 also has an option to import text based on regular expressions. You have to use a separate application Filter Settings for this. After the import you just analyze the resulting ttx file as usual.

I realize that having to learn... See more

FarkasAndras

Local time: 00:48
English to Hungarian
+ ...

CAT

Feb 9, 2012

I fervently hope that you'll be using a CAT for this job. The localization of SW strings requires strict formatting consistency and there are a lot of repetitions etc., so it' really not the job you'd want to do by typing over the original.
Now, If you do use a CAT, just do the word count there.
Studio has the required capabilities (i.e. you can specify regex rules that separate the translatable text from the rest), and the Studio package also comes with a specialized sw localization tool (Passolo). Of course there are lots of other tools that'll work, too.

The more interesting question is: who is in charge of this project? Isn't there a PM/client who sorts these things out before you get involved? ▲ Collapse

Philip Lees

Greece
Local time: 01:48
Greek to English

Try this

Feb 9, 2012

I had a few minutes to spare, so I set up this:

http://quote.writewords.eu/

If you paste your text in the box and click Submit, it should return you only the stuff that's between quotes.

FarkasAndras

Local time: 00:48
English to Hungarian
+ ...

perl regex

Feb 9, 2012

Philip Lees wrote:

Give the file to somebody you know who uses the Perl programming language, and ask them to run this:

perl -i.bak -pe "s/^.+?\"//; s/\".+$//" yourfilename

That will remove everything from your file except the parts in quotes (the original file will be renamed as yourfilename.bak). You can then count the words in the new file.

This assumes that all the lines have the same format.

It also assumes that there is only one pair of quotes in one line and that there are no escaped quotes inside quoted strings. It'll fail with lines like this:
StringID:4567267; text:"Press the \"Browse\" button to pick a file"; Button:"Browse"
And it doesn't skip lines that have no translatable content at all.

Also, .+? is better written as .* and the " may very well be the last character on the line so .+$// should be .*$//.

So, I'd rewrite your one-liner as:
perl -i.bak -pe "s/^.*\"(.*)\".*$/$1/" yourfilename

...but this still doesn't handle the problem cases I mentioned above.
You could do this (untested) to delete lines that don't contain any quoted string:

perl -i.bak -pe "next unless /\".*\"/; s/^.*\"(.*)\".*$/$1/" yourfilename

... but the bottom line is, it's still only usable if the input file is "simple". You could add negative lookahead/lookbehind to cater for escaped quotes inside the quoted strings etc. to make it work and then somehow adapt it for multiple strings per line, but it starts to get tricky there, and you need to see the input file (or know its spec) to take a reasonable stab at solving the problem.

[Edited at 2012-02-09 10:54 GMT]

Afew

Kazakhstan
Local time: 04:48
English to Kazakh

TOPIC STARTER

Initial stage of the project

Feb 9, 2012

I am at the very beginning of the project and just wanted to know what is the wordcount for now. I will definitely try regex. Thanks!

Philip Lees

Greece
Local time: 01:48
Greek to English

Nobody's perfect

Feb 9, 2012

FarkasAndras wrote:

It also assumes that there is only one pair of quotes in one line and that there are no escaped quotes inside quoted strings. It'll fail with lines like this:
StringID:4567267; text:"Press the \"Browse\" button to pick a file"; Button:"Browse"

Oh, sure, it breaks in lots of cases, as does the simpler match I used on the web version:

/"(.+?)"/

I am well aware of the pitfalls of text parsing, which is why I added the caveat about all lines having the same format as the example provided.

As this is not a Perl or a regex forum, I'll leave it at that.

Ambrose Li

Canada
Local time: 18:48
English
+ ...

simplier perl code

Feb 9, 2012

I think this single line of perl should suffice:

perl -nle 'print $1 if /#command\s+"([^"]*)"/'

This assumes that double quotation can’t occur inside the pair of double quotation marks that marks the string to be translated. Usually this is not the case and (assuming that " is escaped with a single backslash) the perl needed will more likely be

perl -nle 'print $1 if /#command\s+"((?:\\"|[^"])*)"/'

Of course, if escaping of quotation marks occurs but is not signalled by backslashes then the perl code needed will be different.

ETA: The above assumes that continuations don’t occur. If continuations do occur the above won’t work and one-liner solutions might not be sufficient…

[Edited at 2012-02-09 19:03 GMT] ▲ Collapse

Pages in topic: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maria Castro	[Call to this topic]
Nawal Kramer	[Call to this topic]

You can also contact site staff by submitting a support request »

Counting words in a txt file within quotation marks

Translation news

» Tripoli Hosts International Conference on Quran Translation
(0 comments)
» Germany-based AI translation unicorn DeepL seeks to expand in Middle East
(0 comments)
» Japanese startup to use AI to translate manga
(0 comments)

Submit translation news »
Read more translation news »

Forum rules

Help and orientation

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Counting words in a txt file within quotation marks

Counting words in a txt file within quotation marks

You have native languages that can be verified

Your current localization setting

Select a language