Quantcast

Zend_Locale_UTF8 updated, some new ideas

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zend_Locale_UTF8 updated, some new ideas

GavinZend
I believe Thomas is closely following the team's past decisions:

   
http://www.nabble.com/forum/ViewPost.jtp?post=6867072&framed=y&skin=16154

Especially see the links in that post above (copied below) that define
scope:

    http://www.nabble.com/Zend_Seach_Lucene-tf2315524s16154.html#a6490854

Today, the community discussed the possible use of iconv to handle
simple UTF8 string manipulation tasks:

    http://framework.zend.com/wiki/display/ZFDEV/Wildfire+Jabber+Server

I will start a new discussion thread for this, and the conclusion may
affect the scope defined above.

Cheers,
Gavin

Willie Alberty wrote:

> I've been following this thread with great interest, and even
> attempted to join in the conversation a couple of times. However, it
> seems I am completely missing the point of the discussion...
>
> André, Ahmed, and myself would like to see Zend_Locale_UTF8 do more
> Unicode-aware things than it does now. This would make it useful
> outside of a Locale-only context. However, Gavin reminded us of the
> prior direction from Zend that said Zend_Locale_UTF8 is to be
> essentially a private helper class, to be used only for the explicit
> needs of Zend_Locale. Fine.
>
> But in reading your last response, it seems as though you either don't
> see the need for Zend_Locale_UTF8 or don't want it:
>
> On Oct 18, 2006, at 11:36 AM, Thomas Weidner wrote:
>
>> 1.) Zend_Locale_Format handles the input string, stripping
>> seperators, changing fraction and negative sign.
>> So our input string is normalized. This is already implemented.
>
> So you already have a comprehensive table of Unicode characters that
> represent the decimal and thousands separators, as well as the
> fraction and negative signs for every language supported by
> Zend_Locale_Format?
>
>> 2.) Zend_Locale_Format calls Zend_Locale_UTF8 for converting the
>> normalized value to local signs.
>> So we have a normalized string with local signs.
>
> So you already have a comprehensive table of Unicode characters that
> are numeric digits? How are you able to identify which characters are
> digits, which are delimiters, and which are white space? If you
> already know what characters are digits, why would you need
> Zend_Locale_UTF8 at all? Just use same tables for conversion that
> you're using for parsing.
>
>> 3.) Zend_Locale_Format localizes the returned string adding
>> seerators, negative and fraction signs.
>> This is also already implemented.
>
> Again, this implies in-depth knowledge of the character sets involved
> for every language, including knowledge of which characters are
> encoded in one-, two-, and three-bytes. Otherwise, you would not be
> able to reliably insert a decimal separator at the correct location in
> the byte stream.
>
>> 4.) In Zend_Measure_Numbers there will be added some functions as
>> toArabic, fromArabic, toChinese, fromChinese and so on...
>> So we could convert numbers locale aware to other number formats.
>> A conversion for the roman, binary, octal, hexadecimal, decimal and
>> some other number formats are already implemented there.
>
> Again, it sounds like all of the functionality you need is already
> implemented elsewhere.
>
> Can you be more specific with the functions you *do* need
> Zend_Locale_UTF8 to perform? After reading through this thread again,
> and factoring in the Zend direction from Gavin, I think having this
> class around is unnecessary.
>
> (André - If this turns out to be true, don't despair... I think there
> is a great need for Unicode manipulation classes in PHP 5. In fact, I
> have an explicit need in some of the work I'm planning for Zend_Pdf.
> They might just need to live outside of Zend_Locale to survive. If the
> adoption rate of PHP 5 by hosting providers is any indication, PHP 6
> is still several years away from being practical, which means Unicode
> classes in the framework are unquestionably valuable.)
>
> --
>
> Willie Alberty, Owner
> Spenlen Media
> [hidden email]
>
> http://www.spenlen.com/
>
>

--
Cheers,
Gavin

Which ZF List?
=================
Everything, except the topics below: [hidden email]

Authorization, Authentication, ACL, Access Control, Session Management
[hidden email]

Tests, Caching, Configuration, Environment, Logging
[hidden email]

All things related to databases
[hidden email]

Documentation, Translations, Wiki Manual / Tutorials
[hidden email]

Internationalization & Localization, Dates, Calendar, Currency, Measure
[hidden email]

Mail, MIME, PDF, Search, data formats (JSON, ...)
[hidden email]

MVC, Controller, Router, Views, Zend_Request*
[hidden email]

Community Servers/Services (shell account, PEAR channel, Jabber)
[hidden email]

Web Services & Servers (HTTP, SOAP, Feeds, XMLRPC, REST)
[hidden email]


How to un/subscribe:  http://framework.zend.com/wiki/x/GgE


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zend_Locale_UTF8 updated, some new ideas

GavinZend
In reply to this post by Thomas Weidner
Zend_Locale_UTF8 is purposely named wrong. We chose that name to
indicate it exists only to support the Zend_Locale classses, not because
Zend_Locale_UTF8 should know anything at all about localization (or
numbering systems).  Reference:
    http://www.nabble.com/Zend_Seach_Lucene-tf2315524s16154.html#a6490854

Regarding converting numbers to/from:

http://en.wikipedia.org/wiki/Roman_numerals
http://en.wikipedia.org/wiki/Indian_numbering_system
http://en.wikipedia.org/wiki/Japanese_numerals
etc.

I do not see any objections to including this functionality with
Zend_Locale*, but only ideas on where to put the functions.  If the
logic is broken into functions that use the CDLR and "understand"
locale/language/culture-specific numbering systems, and functions that
do not use CDLR and do not understand specific numbering systems, then
we only need worry about where to put each function.

I'm not sure if this helps, but it makes sense to me, if we group
related functions using the portions of the CDLR that relate to numbers
and number systems into the same class.  Per past discussions (and links
to those discussions in recent emails), Zend_Locale_Utf8 should not
contain logic to perform localization or internationalization.  However,
the exact same functions to perform normalization and formatting of
numbers are still needed, and useful, but might instead by written in a
locale-specific class. Thomas has provided a proposed / partially
implemented hierarchy, including classes for containing these
normalization and formatting functions.

Cheers,
Gavin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zend_Locale_UTF8 updated, some new ideas

Willie Alberty
In reply to this post by GavinZend
On Oct 18, 2006, at 1:35 PM, Gavin Vess wrote:

> I believe Thomas is closely following the team's past decisions:
>
>    http://www.nabble.com/forum/ViewPost.jtp?
> post=6867072&framed=y&skin=16154
>
> Especially see the links in that post above (copied below) that  
> define scope:
>
>    http://www.nabble.com/Zend_Seach_Lucene- 
> tf2315524s16154.html#a6490854

I did review that post... It was in response to one of my  
messages. ;-) But the defined scope is incredibly vague:

     The expected value and usefulness of Zend_Locale_Utf8 is not  
doubted,
     but we must be careful to avoid requirements creep.  Previously, we
     agreed to allow UTF8 emulation functions (PHP functions written  
in pure
     PHP that support UTF8 strings) *only* for the functions absolutely
     required for Zend_Locale* classes to work.

There is no mention of which functions will be required, which has  
lead to a great deal of speculation. André speculated that  
equivalents of intval() and floatval() might be needed. Ahmed and I  
thought that was a good idea.

In the discussion that followed, we've only been arguing about what  
Zend_Locale_UTF8 should *not* be. I still haven't seen a concise  
description of what it *should* be.

> Today, the community discussed the possible use of iconv to handle  
> simple UTF8 string manipulation tasks:
>
>    http://framework.zend.com/wiki/display/ZFDEV/Wildfire+Jabber+Server

If the needs of Zend_Locale are limited to simple string  
manipulations, iconv would be a much more efficient solution.

I agree that full support for Unicode string handling within the  
framework should be provided by mbstring or PHP 6. But I believe that  
the framework could still benefit from some Unicode utility classes  
as there are a whole host of character attributes in the UCD that can  
be useful.

--

Willie Alberty, Owner
Spenlen Media
[hidden email]

http://www.spenlen.com/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

UTF8 and PHP's iconv extension

GavinZend
In reply to this post by GavinZend
The following ZF components currently use iconv functions:

* Zend/Pdf/FileParser.php
* Zend/Pdf/Resource/Font/Standard/*.php
* Zend/Pdf/Resource/Font.php
* Zend/Search/Lucene/Field.php
* Zend/Service/Flickr.php
* Zend/XmlRpc/Client.php

http://www.php.net/manual/en/ref.iconv.php

Questions
==============
(1) Do the iconv functions actually work consistently in practice for
PHP 5.1.4+ on all major platforms with the UTF8 charset?
I have not yet found any reports indicating the iconv functions are
unstable, inconsistent, or unusable with UTF8 strings.
However, apparently Gentoo's default PHP 5.1.6 ebuild tries to build PHP
without libxml and without iconv, unless the "xml" and "iconv" USE flags
are enabled.

(2) Would adding "iconv" to the official list of requirements for the ZF
impose any practical burden on anyone?
The libxml extension requires iconv. Many things require libxml.  I have
not found any distro shipping PHP 5.1.4+ that does not include support
for the iconv functions.  The windows binary downloaded via php.net was
compiled with support for these functions.  The configure script that
ships with PHP 5.1.4+ includes "--with-iconv" by default.

(3) When needed for working with UTF8 strings, are there any reasons to
avoid using these iconv functions inside Zend_Locale and
Zend_Search_Lucene classes?
* iconv_strlen()
* iconv_strpos()
* iconv_strrpos()
* iconv_substr()

||Cheers,
Gavin

P.S.
$cleanedUTF8 = iconv("UTF-8", "UTF-8//IGNORE", $badUTF8);
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zend_Locale_UTF8 updated, some new ideas

GavinZend
In reply to this post by Willie Alberty
Willie Alberty wrote:
> There is no mention of which functions will be required, which has
> lead to a great deal of speculation. André speculated that equivalents
> of intval() and floatval() might be needed. Ahmed and I thought that
> was a good idea.
>
> In the discussion that followed, we've only been arguing about what
> Zend_Locale_UTF8 should *not* be. I still haven't seen a concise
> description of what it *should* be.
If Thomas can not implement a function important to the "Locale"
project, without a supporting, low-level function to perform some
manipulation of UTF8 strings, then the function becomes a candidate for
Zend_Locale_Utf8.  The same situation exists for Zend_Search_Lucene and
Alexander.  We look to the Locale and Search teams to declare which UTF8
string manipulation/analysis functions they truly need.  Currently, I
encourage the entire community to review the iconv functions and
situation and write their thoughts about using these functions in reply
to our recent discussions on this list.

The original intent was to purposely define Zend_Locale_Utf8 as the
empty set of functions, and only add when absolutely required and
justified, per the criteria mentioned previously.  It seems that iconv
already provides the three of the functions most probably needed
(strlen, strpos, substr).

However, there remains the possibility of a different component, a more
general purpose Zend_Utf8 that might live in the Laboratory, if someone
wishes to revive the original proposal.

Cheers,
Gavin

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: UTF8 and PHP's iconv extension

Willie Alberty
In reply to this post by GavinZend
On Oct 18, 2006, at 3:37 PM, Gavin Vess wrote:

> The following ZF components currently use iconv functions:
>
> * Zend/Pdf/FileParser.php
> * Zend/Pdf/Resource/Font/Standard/*.php
> * Zend/Pdf/Resource/Font.php
> ...

Zend_Pdf primarily uses iconv() to translate between a string using  
an arbitrary character encoding (typically ISO-8859-1) to the Windows  
ANSI character set (CP-1252) when preparing to draw text on a page.  
It also uses iconv() when parsing TrueType font programs to extract  
strings such as font name, copyright, etc.

> (1) Do the iconv functions actually work consistently in practice  
> for PHP 5.1.4+ on all major platforms with the UTF8 charset?

There have been no reported problems with iconv() in conjunction with  
Zend_Pdf. In addition to the ISO-8859-1 and CP-1252 character sets,  
the font parsing classes use UTF-16BE (2-byte big-endian encoding)  
extensively. Future text layout classes will also require UTF-16BE  
support.

> (2) Would adding "iconv" to the official list of requirements for  
> the ZF impose any practical burden on anyone?

Not to myself or any of my clients.

It should be noted that Zend_Pdf would be unusable with iconv. If it  
cannot be made a requirement for the framework as a whole, it must be  
listed as a requirement for Zend_Pdf.

--

Willie Alberty, Owner
Spenlen Media
[hidden email]

http://www.spenlen.com/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zend_Locale_UTF8 updated, some new ideas

Willie Alberty
In reply to this post by GavinZend
On Oct 18, 2006, at 6:27 PM, Gavin Vess wrote:

> The original intent was to purposely define Zend_Locale_Utf8 as the  
> empty set of functions, and only add when absolutely required and  
> justified, per the criteria mentioned previously.  It seems that  
> iconv already provides the three of the functions most probably  
> needed (strlen, strpos, substr).

str_replace can also be easily implemented by using iconv_strpos and  
iconv_substr. I don't think that such a function should live in  
Zend_Locale_Utf8 though, as it would be character set-agnostic. It  
would probably be best to have it as a static utility method in  
Zend_Locale.

> However, there remains the possibility of a different component, a  
> more general purpose Zend_Utf8 that might live in the Laboratory,  
> if someone wishes to revive the original proposal.

I am working on some layout classes for Zend_Pdf that will handle  
things like wrapping long lines of text, text alignment, font size  
and style changes on a single line, etc. The implementation of these  
classes will require a Unicode-based backing store for the strings.

In the current (half-done) implementation, I've created a  
Zend_Pdf_Text class along with several Unicode services helper  
classes. The Zend_Pdf_Text class stores an arbitrary amount of  
Unicode text, accepting source strings in any encoding. There is a  
subclass which allows attributes such as font, size, color,  
alignment, etc. to be placed on the string. The helper classes  
provide important character attributes from the UCD such as line  
break classes, text direction (left-to-right or right-to-left)  
classes, and bidi (bi-directional text) mirrored characters, which  
are required to properly lay out strings on the PDF page.

After watching the discussion here and looking more closely at  
André's implementation of Zend_Locale_UTF8, I think these classes  
would be more useful at a higher level:

Zend_String
--------------------
General-purpose Unicode string storage class. Would contain most of  
the string manipulation functions André has already implemented in  
Zend_Locale_Utf8 and whatever else from my Zend_Pdf_Text that would  
be useful. Strings objects are constructed from ordinary PHP strings  
using any character encoding supported by iconv.

Zend_String_Attributed
--------------------
Extends Zend_String allowing attributes to be set on ranges of  
characters such as font size, color, alignment, etc. as well as any  
other user-defined attributes. Zend_String_Attributed objects would  
be used for advanced layout in Zend_Pdf. An attributed string class  
would also pave the way for RTF and Microsoft Word document generation.

Zend_Range
--------------------
Primitive range class, used by Zend_String_Attributed for setting  
character ranges. Has convenience functions to calculate unions,  
intersections, etc.

Zend_Unicode
--------------------
Static helper class which vends interesting information from the  
Unicode Character Database (UCD) such as character classes (i.e. - is  
the character numeric?), line break classes (for PDF layout), etc.  
This data comes from specialized Zend_Unicode_* objects which are  
loaded on-demand.


While PHP 6 will provide native support for Unicode strings, that  
release is still pretty far off (there is a lot of work remaining:  
http://www.php.net/~scoates/unicode/render_func_data.php). In  
addition, I don't think there are any plans for an attributed string  
class or utility functions that return data from the UCD.

More importantly, Unicode string support in PHP 6 will be enabled via  
an INI switch. It will be hard enough to get web hosting providers to  
offer PHP 6 at all. Fear of breaking their favorite control panel  
software or some esoteric extension they're using might mean getting  
native Unicode support will be next to impossible.

For these reasons, and for those applications that might need to  
interact with Unicode strings even in PHP 6, but with the native  
Unicode support disabled, I feel strongly that such classes are  
useful. I'd be happy to help lead this effort as I have an immediate  
need for this capability (in Zend_Pdf).

--

Willie Alberty, Owner
Spenlen Media
[hidden email]

http://www.spenlen.com/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: UTF8 and PHP's iconv extension

GavinZend
In reply to this post by GavinZend
Based on feedback from many, and current usage within the ZF, ZF
effectively already requires the iconv extension.  No objections to use,
significant problems or issues have been found that might prevent our
use of PHP's iconv functions.  Therefore, use of these iconv functions
are encouraged, when needed.  A small note has been appended to the
draft coding standards:

http://framework.zend.com/wiki/x/PQ

Cheers,
Gavin

Gavin Vess wrote:

> The following ZF components currently use iconv functions:
>
> * Zend/Pdf/FileParser.php
> * Zend/Pdf/Resource/Font/Standard/*.php
> * Zend/Pdf/Resource/Font.php
> * Zend/Search/Lucene/Field.php
> * Zend/Service/Flickr.php
> * Zend/XmlRpc/Client.php
>
> http://www.php.net/manual/en/ref.iconv.php
>
> Questions
> ==============
> (1) Do the iconv functions actually work consistently in practice for
> PHP 5.1.4+ on all major platforms with the UTF8 charset?
> I have not yet found any reports indicating the iconv functions are
> unstable, inconsistent, or unusable with UTF8 strings.
> However, apparently Gentoo's default PHP 5.1.6 ebuild tries to build
> PHP without libxml and without iconv, unless the "xml" and "iconv" USE
> flags are enabled.
>
> (2) Would adding "iconv" to the official list of requirements for the
> ZF impose any practical burden on anyone?
> The libxml extension requires iconv. Many things require libxml.  I
> have not found any distro shipping PHP 5.1.4+ that does not include
> support for the iconv functions.  The windows binary downloaded via
> php.net was compiled with support for these functions.  The configure
> script that ships with PHP 5.1.4+ includes "--with-iconv" by default.
>
> (3) When needed for working with UTF8 strings, are there any reasons
> to avoid using these iconv functions inside Zend_Locale and
> Zend_Search_Lucene classes?
> * iconv_strlen()
> * iconv_strpos()
> * iconv_strrpos()
> * iconv_substr()
>
> Cheers,
> Gavin
>
> P.S.
> $cleanedUTF8 = iconv("UTF-8", "UTF-8//IGNORE", $badUTF8);
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: UTF8 and PHP's iconv extension

Thomas Weidner
> Based on feedback from many, and current usage within the ZF, ZF
> effectively already requires the iconv extension.  No objections to use,
> significant problems or issues have been found that might prevent our use
> of PHP's iconv functions.  Therefore, use of these iconv functions are
> encouraged, when needed.  A small note has been appended to the draft
> coding standards:

I already included iconv for Zend_Locale_Format within the functions
where I found problems.

From Zend_Locale's view there's no need for UTF8 anymore.

As discussed in another post it would be nice to have a function to convert
between different number writing systems.
In my opinion this is for now the only functionality which is valueable to
be included.
The question is if we want to include this into the framework or not.

Do we want to include the complete UTF8 classes for this functionality or do
we wait
until PHP6 for this...

Greetings
Thomas

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: UTF8 and PHP's iconv extension

GavinZend
I have no objections to a number conversion system, provided the source
code is placed into appropriate Zend_Locale* classes.

Cheers,
Gavin

Thomas Weidner wrote:

> As discussed in another post it would be nice to have a function to
> convert
> between different number writing systems.
> In my opinion this is for now the only functionality which is
> valueable to be included.
> The question is if we want to include this into the framework or not.
>
> Do we want to include the complete UTF8 classes for this functionality
> or do we wait
> until PHP6 for this...
>
> Greetings
> Thomas
12
Loading...