[wg11] EXPRESS and P28 String Data Type issue
Ed Barkmeyer
edbark at nist.gov
Tue Sep 7 18:45:17 EDT 2004
Phil Spiby wrote:
> It has come to Eurostep's attention (whilst testing some of our P28
> production capabilities) that there is an issue with the domain of the
> string data type and the mapping of this data to P28 files.
>
> The definition of the string data type is:
> The string data type has as its domain sequences of characters. The
> characters that are permitted as part of a string value are those characters
> allocated to cells 08 to 0D and the graphic characters lying in the ranges
> 20 to 7E and A0 to 10FFFE of ISO/IEC 10646-1.
>
> This definition was adopted in TC1 when the EXPRESS language was harmonized
> with SGML and XML such that data in a CDATA section of SGML/XML was not
> corrupted when passed into an EXPRESS based data structure. It now seems
> that the P28 approach of mapping the EXPRESS string data type to a
> normalised string removes this capability, since all formatting characters
> (cells 08-0D) handled by a CDATA section and the EXPRESS data type are now
> lost.
>
> Basically the P28 mapping of data from EXPRESS is removing data from the
> exchange file (since it is converting any formatting characters into
> spaces).
The alignment was brief.
From REC-xml-1.0-20040204:
"2.2 Characters
... Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646. ... XML processors MUST accept
any character in the range specified for Char.
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
| [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
...
4.1 Character and Entity References
...
Well-formedness constraint: Legal Character
Characters referred to using character references MUST match the
production for Char."
With the definition of the EXPRESS STRING type given in EXPRESS Ed2,
quoted by Phil, it is *not possible* to represent all valid EXPRESS
STRING values in XML! In particular, the characters #8 (BS), #b (VT),
#c (FF) are not valid in XML documents. They are not valid as character
references (e.g. �b), either, because all character references must be
to valid characters.
The remaining three 'formatting character' HT (#9), LF (#a) and CR (#d)
are precisely the difference between xs:string and xs:normalizedString.
From REC-xmlschema-2-20010502:
"3.2.1 string
[Definition:] The string datatype represents character strings in XML.
The ·value space· of string is the set of finite-length sequences of
characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the
Char production from [XML 1.0 (Second Edition)].
...
3.3.1 normalizedString
[Definition:] normalizedString represents white space normalized
strings. The ·value space· of normalizedString is the set of strings
that do not contain the carriage return (#xD), line feed (#xA) nor tab
(#x9) characters. The ·lexical space· of normalizedString is the set of
strings that do not contain the carriage return (#xD) nor tab (#x9)
characters."
That is, the HT, CR and LF characters are not valid in a value of
xs:normalizedString, but they are valid characters in xs:string.
Note: The actual handling of the verbatim formatting characters in a
document is quite complicated in XML (and has nothing to do with XML
schema), but they can always be inserted verbatim in CDATA sections and
in character references (e.g. �a) otherwise.
I would agree with Phil that the default mapping of STRING should be to
xs:string rather than xs:normalizedString, in order to permit the tab,
carriage-return and line-feed/new-line characters to be included in a
STRING value. Now, the default should be to support the EXPRESS STRING
data type fully, but we simply can't do that. So the question is: Which
subtype of STRING should we map to by default -- the one with no
formatting characters at all, or the one with HT, CR and LF?
I would observe that it is probably rarely the intention of the modeler
to allow the formatting characters, but the intended character set
limitations are also rarely documented in the EXPRESS models. OTOH, I
would also observe that this is true of most XML objects whose datatype
is modeled as xs:string.
-Ed
--
Edward J. Barkmeyer Email: edbark at nist.gov
National Institute of Standards & Technology
Manufacturing Systems Integration Division
100 Bureau Drive, Stop 8264 Tel: +1 301-975-3528
Gaithersburg, MD 20899-8264 FAX: +1 301-975-4694
"The opinions expressed above do not reflect consensus of NIST,
and have not been reviewed by any Government authority."
More information about the wg11
mailing list