PHP6, Unicode and TextIterator features
I’ve just install the last version of PHP6 dev and I’ve decided to test the famous new feature, the PHP Unicode Support. I will not explain new things about PHP6 or Unicode or TextIterator, it’s just my discoveries test on this features.
So the first thing to do is to enable PHP6 Unicode in the php.ini file.
;;;;;;;;;;;;;;;;;;;; ; Unicode settings ; ;;;;;;;;;;;;;;;;;;;;unicode.semantics = on unicode.runtime_encoding = utf-8 unicode.script_encoding = utf-8 unicode.output_encoding = utf-8 unicode.from_error_mode = U_INVALID_SUBSTITUTE unicode.from_error_subst_char = 3f
Because I’m French I have some problems with string manipulations due to accents.
So my first test was to test the strlen function in unicode…
$word = "être"; echo "Length: ".strlen($word);
WONDERFULL we have this : “Length: 4″. I just have a smile on my face… but it was just the beginning !
My Second test was with the TextIterator a new SPL Iterator for PHP6…
$word = "être"; foreach (new TextIterator($word, TextIterator::CHARACTER) as $character) { var_inspect($character); }
The output : unicode(1) “ê” { 00ea } unicode(1) “t” { 0074 } unicode(1) “r” { 0072 } unicode(1) “e” { 0065 }
Seems to be great we will be able to have the unicode value of characters and much more…
Characters look great but Words looks better…
$sentences = "Bonjour, nous sommes Français ! Aïe :)"; foreach (new TextIterator($sentences, TextIterator::WORD) as $word) { var_inspect($word); }
Just to get that : unicode(7) “Bonjour” { 0042 006f 006e 006a 006f 0075 0072 } unicode(1) “,” { 002c } unicode(1) ” ” { 0020 } unicode(4) “nous” { 006e 006f 0075 0073 } unicode(1) ” ” { 0020 } unicode(6) “sommes” { 0073 006f 006d 006d 0065 0073 } unicode(1) ” ” { 0020 } unicode(8) “Français” { 0046 0072 0061 006e 00e7 0061 0069 0073 } unicode(1) ” ” { 0020 } unicode(1) “!” { 0021 } unicode(1) ” ” { 0020 }
Ok spaces and other things look likes words… why not maybe there is a constant in the TextIterator to get only real words…
When we make a var_inspect of the a charcter or a word we have the encoding, the number of characters and the unicode value of them. So If when we do this :
echo "\u0046\u0072\u0061\u006e\u00e7\u0061\u0069\u0073";
we get this : “Français”. Wonderfull, one more time.
String manipulation is one thing, but PHP 6 enable sentences manipulations with the TextIterator !!!
$sentences = "Bonjour, nous sommes Français"; $word_break = new TextIterator($sentences, TextIterator::WORD);
For the last word :
$word_break->preceding($word_break->last()); echo $word_break->current();
The first :
$word_break->first(); echo $word_break->current();
And the third word of the sentence for example :
$word_break->first(); $word_break->next(3); echo $word_break->current();
To close this first approach with PHP6 and Unicode i wanted to test one off feature I’ve seen at A PHP Conference in Paris. It was the str_transliterate. This function give you a sentence that look like your sentence in a different language (mapping sound of letters to a language to another).
$name = "Antoine Ughetto"; $jap = str_transliterate($name, 'Latin', 'Katakana'); echo str_transliterate($jap, 'Any', 'Latin');
Oh yeah my Name is Japanese (アントイネ ウグヘット) sounds like “antoine uguhetto”.
All of this was very interesting (for me), just not really easy to test features without the PHP documentation (thanks to the ReflectionClass).
Thanks to Andrei Zmievski for his blog posts that helps me making my tests…


9 Comments
Jump to comment form | comments rss [?] | trackback uri [?]