VASmalltalk – ICU: Searching

searching of substrings in strings are perhaps one of the most wanted features in computer systems. Searching in UTF8 or UTF16 considering collation rules or locales may be pretty difficult.

With version 0.0.30 I added more wrappers around the search API and now you can compare unicode strings pretty easily.

Perhaps the simpliest search code looks like thefirst example below. It shows a typical Smalltalk search statement but now as an uniode search using the current MSKLocale (here on my machine “de_DE”):

| text search |
	
text := 'The quick brown fox jumped over the lazy fox'.
search := 'fox'.
	
self assert: (text asUnicodeString indexOf: search startingAt: 1) = 17.
self assert: (text asUnicodeString indexOf: search startingAt: 18) = 42	

In addition you may search for all occurences of the search string by executing:

'The quick brown fox jumped over the lazy fox' 
       asUnicodeString allMatchesOf: 'fox'

which returns a collection with match entries. Each match entry is an array with the information of the found index, the matched size and the matches text.

Special needs in comparing (sorting) and searching always leads one to instances of UCollator. In a previous posting I shows a comparing example using the german words “bücher” and “buecher”. Both are equal, IF you want to sort via DIN5007-2. Considering this problem in the #allMatchesOf: method brings us to the following test case:

| text search result anUCollator compare1 compare2 |
	
text := 'The quick brown bücher jumped over the lazy buecher'.
search := 'bücher'.
	
result := text asUnicodeString 
		allMatchesOf: search 
	        using: (anUCollator := UCollator collatorForDIN500702) 
		closing: false.

self assert: (result size = 2).

"here we compare the texts found with the search text"
compare1 := (result first at: 3) equalTo: search using: anUCollator.
compare2 := (result last at: 3) equalTo: search using: anUCollator.
	
anUCollator close.

"compare found indexes"
self assert: ((result first at: 1) = 17).
self assert: ((result last at: 1) = 45).

"the texts found have NOT the same size"
self assert: ((result first at: 2) = 6).
self assert: ((result last at: 2) = 7).

"and the texts itself are only equal due to DIN ...
self assert: (compare1).
self assert: (compare2).	
This entry was posted in Smalltalk and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s