Quantcast

StringUtils / Character sets

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

StringUtils / Character sets

Marc Bennewitz (private)
Hi all,

I did some test implementation working more simple with different
character sets using iconv / mbstring and native code.

The goal was too let users decide which of the php extension use to
handle different character sets.
The test implementation did a very simple, fast and expansible wrapper
for iconv/mbstring/native.
(https://github.com/marc-mabe/zf2/blob/string/library/Zend/Stdlib/StringUtils.php)

I also did a simple benchmark that shows the mbstring adapter is faster
as iconv even if wrapped with an adapter:
(https://gist.github.com/2938899)

$ php stringutils-bench.php
native (▒): 0.0067460536956787
NativeAdapter (): 0.035496950149536
IconvNative (ß): 0.03082799911499
IconvAdapter (ß): 0.038977146148682
MbStringNative (ß): 0.0065720081329346
MbStringAdapter (ß): 0.010815858840942


Example 1:
$stringAdapter = StringUtils::getAdapterByCharset("UTF-8");
$stringAdapter->strlen("ß");
// ...

Example 2: (Fallback to ASCII if single byte charset)
try {
StringUtils::getAdapterByCharset($charset)->strlen($str);
} catch (Exception $e) {
if (StringUtils::isSingleByteCharset($charset)) {
StringUtils::getAdapterByCharset("ASCII")->strlen($str);
}
}

What do you think about this ?

Greetings
Marc

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: StringUtils / Character sets

DeNix
Hi Marc,
I think you did a great job, but now it looks like full featured
component, so not sure it belongs to Stdlib anymore.
Whether or not this component will be accepted, I think there should be
unified way to work with miltibyte strings throughout ZF2

Denis

On 16.06.2012 2:02, Marc Bennewitz wrote:

> Hi all,
>
> I did some test implementation working more simple with different
> character sets using iconv / mbstring and native code.
>
> The goal was too let users decide which of the php extension use to
> handle different character sets.
> The test implementation did a very simple, fast and expansible wrapper
> for iconv/mbstring/native.
> (https://github.com/marc-mabe/zf2/blob/string/library/Zend/Stdlib/StringUtils.php)
>
> I also did a simple benchmark that shows the mbstring adapter is faster
> as iconv even if wrapped with an adapter:
> (https://gist.github.com/2938899)
>
> $ php stringutils-bench.php
> native (▒): 0.0067460536956787
> NativeAdapter (): 0.035496950149536
> IconvNative (ß): 0.03082799911499
> IconvAdapter (ß): 0.038977146148682
> MbStringNative (ß): 0.0065720081329346
> MbStringAdapter (ß): 0.010815858840942
>
>
> Example 1:
> $stringAdapter = StringUtils::getAdapterByCharset("UTF-8");
> $stringAdapter->strlen("ß");
> // ...
>
> Example 2: (Fallback to ASCII if single byte charset)
> try {
> StringUtils::getAdapterByCharset($charset)->strlen($str);
> } catch (Exception $e) {
> if (StringUtils::isSingleByteCharset($charset)) {
> StringUtils::getAdapterByCharset("ASCII")->strlen($str);
> }
> }
>
> What do you think about this ?
>
> Greetings
> Marc
>


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: StringUtils / Character sets

jeremiah
This post has NOT been accepted by the mailing list yet.
In reply to this post by Marc Bennewitz (private)
On Jun 15, 2012, at 3:03 PM, Marc Bennewitz (private) [via Zend Framework Community] wrote:

> Hi all,
>
> I did some test implementation working more simple with different
> character sets using iconv / mbstring and native code.

I'm am just relaying this comment for a colleague who is not on the list. This is outside my expertise. Here is his comment:

Performance is much less important for handling UTF-8 than knowing the limitations of mbstring especially. mbstring is faster because it only handles a small set of European languages plus some common Japanese characters. And why are you not comparing against the gold standard which is the intl extension.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: StringUtils / Character sets

Marc Bennewitz (private)
In reply to this post by DeNix
Currently I only did some tests to have a very fast and extensible API
whet it should name or were it should belongs to is debatable later ;)

@jeremiah
You post hasn't been accepted. I only noticed your commend by nabble.
>I'm am just relaying this comment for a colleague who is not on the
list. This is outside my expertise. Here is his comment:
>
>Performance is much less important for handling UTF-8 than knowing the
limitations of mbstring especially. mbstring is faster because it only
handles a small set of European languages plus some common >Japanese
characters. And why are you not comparing against the gold standard
which is the intl extension.
Performance is very important because if you have to handle with
different character sets you need to wrap each string function if you
don't won't to hard code on one extension.
The intl extension is for internationalization thats not the same as
working with different character sets.
The mbstring extension doesn't handle languages it handles character
sets and it supports some different character sets that iconv.

Greetings
Marc

On 17.06.2012 22:45, Denis Portnov wrote:

> Hi Marc,
> I think you did a great job, but now it looks like full featured
> component, so not sure it belongs to Stdlib anymore.
> Whether or not this component will be accepted, I think there should
> be unified way to work with miltibyte strings throughout ZF2
>
> Denis
>
> On 16.06.2012 2:02, Marc Bennewitz wrote:
>> Hi all,
>>
>> I did some test implementation working more simple with different
>> character sets using iconv / mbstring and native code.
>>
>> The goal was too let users decide which of the php extension use to
>> handle different character sets.
>> The test implementation did a very simple, fast and expansible wrapper
>> for iconv/mbstring/native.
>> (https://github.com/marc-mabe/zf2/blob/string/library/Zend/Stdlib/StringUtils.php)
>>
>>
>> I also did a simple benchmark that shows the mbstring adapter is faster
>> as iconv even if wrapped with an adapter:
>> (https://gist.github.com/2938899)
>>
>> $ php stringutils-bench.php
>> native (▒): 0.0067460536956787
>> NativeAdapter (): 0.035496950149536
>> IconvNative (ß): 0.03082799911499
>> IconvAdapter (ß): 0.038977146148682
>> MbStringNative (ß): 0.0065720081329346
>> MbStringAdapter (ß): 0.010815858840942
>>
>>
>> Example 1:
>> $stringAdapter = StringUtils::getAdapterByCharset("UTF-8");
>> $stringAdapter->strlen("ß");
>> // ...
>>
>> Example 2: (Fallback to ASCII if single byte charset)
>> try {
>> StringUtils::getAdapterByCharset($charset)->strlen($str);
>> } catch (Exception $e) {
>> if (StringUtils::isSingleByteCharset($charset)) {
>> StringUtils::getAdapterByCharset("ASCII")->strlen($str);
>> }
>> }
>>
>> What do you think about this ?
>>
>> Greetings
>> Marc
>>
>
>
>


Loading...