[wg11] EXPRESS and P28 String Data Type issue

Tue Sep 7 18:45:17 EDT 2004

Phil Spiby wrote:

> It has come to Eurostep's attention (whilst testing some of our P28
> production capabilities) that there is an issue with the domain of the
> string data type and the mapping of this data to P28 files.
> 
> The definition of the string data type is:
> The string data type has as its domain sequences of characters. The
> characters that are permitted as part of a string value are those characters
> allocated to cells 08 to 0D and the graphic characters lying in the ranges
> 20 to 7E and A0 to 10FFFE of ISO/IEC 10646-1.
> 
> This definition was adopted in TC1 when the EXPRESS language was harmonized
> with SGML and XML such that data in a CDATA section of SGML/XML was not
> corrupted when passed into an EXPRESS based data structure. It now seems
> that the P28 approach of mapping the EXPRESS string data type to a
> normalised string removes this capability, since all formatting characters
> (cells 08-0D) handled by a CDATA section and the EXPRESS data type are now
> lost.
> 
> Basically the P28 mapping of data from EXPRESS is removing data from the
> exchange file (since it is converting any formatting characters into
> spaces).

The alignment was brief.

 From REC-xml-1.0-20040204:

"2.2 Characters
... Legal characters are tab, carriage return, line feed, and the legal 
characters of Unicode and ISO/IEC 10646. ... XML processors MUST accept 
any character in the range specified for Char.

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
                              | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
...

4.1 Character and Entity References
...
Well-formedness constraint: Legal Character

Characters referred to using character references MUST match the 
production for Char."

With the definition of the EXPRESS STRING type given in EXPRESS Ed2, 
quoted by Phil, it is *not possible* to represent all valid EXPRESS 
STRING values in XML!  In particular, the characters #8 (BS), #b (VT), 
#c (FF) are not valid in XML documents.  They are not valid as character 
references (e.g. &#0b), either, because all character references must be 
to valid characters.

The remaining three 'formatting character' HT (#9), LF (#a) and CR (#d) 
are precisely the difference between xs:string and xs:normalizedString.

 From REC-xmlschema-2-20010502:

"3.2.1 string

[Definition:]  The string datatype represents character strings in XML. 
The ·value space· of string is the set of finite-length sequences of 
characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the 
Char production from [XML 1.0 (Second Edition)].

...

3.3.1 normalizedString

[Definition:]   normalizedString represents white space normalized 
strings. The ·value space· of normalizedString is the set of strings 
that do not contain the carriage return (#xD), line feed (#xA) nor tab 
(#x9) characters. The ·lexical space· of normalizedString is the set of 
strings that do not contain the carriage return (#xD) nor tab (#x9) 
characters."

That is, the HT, CR and LF characters are not valid in a value of 
xs:normalizedString, but they are valid characters in xs:string.

Note: The actual handling of the verbatim formatting characters in a 
document is quite complicated in XML (and has nothing to do with XML 
schema), but they can always be inserted verbatim in CDATA sections and 
in character references (e.g. &#0a) otherwise.

I would agree with Phil that the default mapping of STRING should be to 
xs:string rather than xs:normalizedString, in order to permit the tab, 
carriage-return and line-feed/new-line characters to be included in a 
STRING value.  Now, the default should be to support the EXPRESS STRING 
data type fully, but we simply can't do that.  So the question is: Which 
subtype of STRING should we map to by default -- the one with no 
formatting characters at all, or the one with HT, CR and LF?

I would observe that it is probably rarely the intention of the modeler 
to allow the formatting characters, but the intended character set 
limitations are also rarely documented in the EXPRESS models.  OTOH, I 
would also observe that this is true of most XML objects whose datatype 
is modeled as xs:string.

-Ed

-- 
Edward J. Barkmeyer                        Email: edbark at nist.gov
National Institute of Standards & Technology
Manufacturing Systems Integration Division
100 Bureau Drive, Stop 8264                Tel: +1 301-975-3528
Gaithersburg, MD 20899-8264                FAX: +1 301-975-4694

"The opinions expressed above do not reflect consensus of NIST,
  and have not been reviewed by any Government authority."