VASmalltalk – ICU and CharsetDetection

I’ve not posted anything for ICU over the time. I’ve changed to ICU49 and most of the tests were pretty green. Some changed from yellow to green, some stayed yellow – because I would expect a different result, but the community does not think so.

In this posting I would like to show the possibility to do charset detection when a special string is given. To summarize: the results are only hints with some mathematical confidence numbers (from 0 .. 100). In addition to the charset the system also tries to detect the language of the text – well sometimes it works pretty well and it is said, that the text should be longer to get more accurate guesses on the language ….

I only want to show to high-level smalltalk API here:

| results aStream |
results :=
 'Hello Marten. This is an english text, but I do not tell you this' icuAllMatchingCharsets.

aStream := WriteStream on: String new.
results do: [ :each | each printOn: aStream. aStream cr. ].
Transcript show: aStream contents

gives the result set:

CharsetMatch (language=[en],  name = [ISO-8859-1], confidence = [51])
CharsetMatch (language=[hu],  name = [ISO-8859-2], confidence = [46])
CharsetMatch (language=[tr],  name = [ISO-8859-9], confidence = [18])
CharsetMatch (language=[],  name = [UTF-8], confidence = [10])
CharsetMatch (language=[ja],  name = [Shift_JIS], confidence = [10])
CharsetMatch (language=[zh],  name = [GB18030], confidence = [10])
CharsetMatch (language=[ja],  name = [EUC-JP], confidence = [10])
CharsetMatch (language=[ko],  name = [EUC-KR], confidence = [10])
CharsetMatch (language=[zh],  name = [Big5], confidence = [10])
CharsetMatch (language=[ar],  name = [IBM420_ltr], confidence = [4])

And if we want to check a german text:

| results aStream |
results :=
 'Hallo Marten. Das ist ein deutscher Text' icuAllMatchingCharsets.

aStream := WriteStream on: String new.
results do: [ :each | each printOn: aStream. aStream cr. ].
Transcript show: aStream contents

gives the result set:

CharsetMatch (language=[de],  name = [ISO-8859-1], confidence = [97])
CharsetMatch (language=[tr],  name = [ISO-8859-9], confidence = [45])
CharsetMatch (language=[hu],  name = [ISO-8859-2], confidence = [30])
CharsetMatch (language=[],  name = [UTF-8], confidence = [10])
CharsetMatch (language=[ja],  name = [Shift_JIS], confidence = [10])
CharsetMatch (language=[zh],  name = [GB18030], confidence = [10])
CharsetMatch (language=[ja],  name = [EUC-JP], confidence = [10])
CharsetMatch (language=[ko],  name = [EUC-KR], confidence = [10])
CharsetMatch (language=[zh],  name = [Big5], confidence = [10])

This service is available with version 08.05.01-49.01.02 of MSKICUApp

About these ads
This entry was posted in Smalltalk and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s