Discussion:
Problems with unicode
(too old to reply)
Ikke
2010-02-08 22:00:33 UTC
Permalink
Hi everybody,

I'm trying to read data from an xml file I've downloaded from the
internet. Whenever I encounter a character like for instance "ä", it is
displayed as two strange characters.

After looking at the file with a hex editor, I've discovered that the
file starts with FFFE, and that each character in the text file takes up
two bytes (3C00 7200 6500 and so on).

To read the file, I've already tried the following two methods, each gave
me the funny characters instead of what I was expecting:

... method one ...
sl := TStringList.Create;
sl.LoadFromFile('c:\1.xml');
FreeAndNil(sl);

... method two ...
AssignFile(tf, 'c:\1.xml');
ReSet(tf);
while Not(eof(tf)) do
begin
ReadLn(tf, s);
end;
CloseFile(tf);

What do I need to do to read this file as plain text, and to get the
characters which are actually in the file (instead of funny
interpretations)?

Any help would be very much appreciated!

Thanks,

Ikke
Maarten Wiltink
2010-02-08 22:16:22 UTC
Permalink
Post by Ikke
I'm trying to read data from an xml file I've downloaded from the
internet. Whenever I encounter a character like for instance "ä", it
is displayed as two strange characters.
It's encoded in UTF-8. Nothing strange about that.
Post by Ikke
After looking at the file with a hex editor, I've discovered that the
file starts with FFFE, and that each character in the text file takes
up two bytes (3C00 7200 6500 and so on).
Oh. Correction. It's encoded in UCS-2. Nothing strange about that, either.
Post by Ikke
To read the file, I've already tried the following two methods, each
... method one ...
sl := TStringList.Create;
sl.LoadFromFile('c:\1.xml');
FreeAndNil(sl);
... method two ...
AssignFile(tf, 'c:\1.xml');
ReSet(tf);
while Not(eof(tf)) do
begin
ReadLn(tf, s);
end;
CloseFile(tf);
Both methods assume that it's encoded in the current Windows codepage.
Probably windows-1252.
Post by Ikke
What do I need to do to read this file as plain text, and to get
the characters which are actually in the file (instead of funny
interpretations)?
XML is funny that way. It considers the file to contain _bytes_, not
characters, and what characters the bytes denote is determined by
an encoding. An XML file is serialised to text, this is then encoded
into bytes, which can be written to a file.

If the file starts FF FE, it's encoded in little-endian UCS-2 - a plain
16-bits identity encoding, available in Delphi as the WideString type.
Use TWideStringList for method one or declare as WideString for
method two.

Guessing the encoding is a vital part of loading XML, and it isn't
always trivial or unambiguous. The XML spec says things about this.

Groetjes,
Maarten Wiltink
Ikke
2010-02-08 22:28:22 UTC
Permalink
"Maarten Wiltink" <***@kittensandcats.net> wrote in news:4b708d37$0$22942$***@news.xs4all.nl:

<snip>
Post by Maarten Wiltink
If the file starts FF FE, it's encoded in little-endian UCS-2 - a
plain 16-bits identity encoding, available in Delphi as the WideString
type. Use TWideStringList for method one or declare as WideString for
method two.
Thanks Maarten, for your quick response!

However, I've tried to change my code but there's no difference.

tf : TextFile;
s : TWideString;
...
AssignFile(tf, 'c:\1.xml');
ReSet(tf);
while Not(eof(tf)) do
begin
ReadLn(tf, s);
ShowMessage(s);
end;
CloseFile(tf);

If I look at the output of the ShowMessage, the characters are still the
same, still not what is in the .xml file when I open it with notepad or
another editor.

As for the TWideStringList, I get an "undeclared identifier" when I
change TStringList to TWideStringList.

Do I need to do anything else to get this to work? Also, although the
file is named .xml and bears a similar structure to xml files, it isn't a
valid xml file (just plain text in an xml-like structure).

Thanks,

Ikke
Maarten Wiltink
2010-02-08 22:42:19 UTC
Permalink
Post by Ikke
Post by Maarten Wiltink
If the file starts FF FE, it's encoded in little-endian UCS-2 - a
plain 16-bits identity encoding, available in Delphi as the WideString
type. Use TWideStringList for method one or declare as WideString for
method two.
[...]
Post by Ikke
However, I've tried to change my code but there's no difference.
tf : TextFile;
s : TWideString;
TWideString: no such type. Let's say you typed this instead of pasting
it.
Post by Ikke
AssignFile(tf, 'c:\1.xml');
ReSet(tf);
while Not(eof(tf)) do
begin
ReadLn(tf, s);
ShowMessage(s);
end;
CloseFile(tf);
If I look at the output of the ShowMessage, the characters are still
the same, still not what is in the .xml file when I open it with
notepad or another editor.
Perhaps TextFile assumes ANSI characters and they're converted without
your permission, I don't know. You might try a file of byte (because
that's really what it is) but you'd have to write your own ReadLn
replacement and that's a bit of a bother. On the other hand, nothing
would break if you loaded the entire file into a single string value.

ShowMessage isn't the best test, either; it takes an AnsiString parameter
and a WideString will get silently converted.
Post by Ikke
As for the TWideStringList, I get an "undeclared identifier" when I
change TStringList to TWideStringList.
It may not exist in your Delphi version. It may be in a unit you haven't
uses'ed.
Post by Ikke
Do I need to do anything else to get this to work? Also, although the
file is named .xml and bears a similar structure to xml files, it isn't
a valid xml file (just plain text in an xml-like structure).
'Plain text' usually means 7-bit ASCII, in which case you wouldn't have
this problem.

You want something that 'reads text' but guesses the encoding for you.
Unfortunately, that's a hard problem.

Groetjes,
Maarten Wiltink
Rudy Velthuis
2010-02-10 13:03:48 UTC
Permalink
Post by Ikke
tf : TextFile;
s : TWideString;
...
AssignFile(tf, 'c:\1.xml');
ReSet(tf);
while Not(eof(tf)) do
begin
ReadLn(tf, s);
ShowMessage(s);
end;
CloseFile(tf);
Don't use the old Pascal routines, they only know ASCII. Use
TWideStringList (from the WideStrings unit) instead, and its
LoadFromFile method.
--
Rudy Velthuis http://rvelthuis.de

"About the use of language: it is impossible to sharpen a pencil
with a blunt axe. It is equally vain to try to do it with ten
blunt axes instead." -- Edsger W. Dijkstra
Hans-Peter Diettrich
2010-02-09 03:04:13 UTC
Permalink
Post by Ikke
I'm trying to read data from an xml file I've downloaded from the
internet. Whenever I encounter a character like for instance "ä", it is
displayed as two strange characters.
After looking at the file with a hex editor, I've discovered that the
file starts with FFFE, and that each character in the text file takes up
two bytes (3C00 7200 6500 and so on).
That's UTF-16 encoding, known in Delphi as WideString or, since D2009,
as UnicodeString. I guess that you are using some older (Ansi) version
of Delphi?

I'd use an XML component to deal with XML files. Search for a TXML...
class in your online help.

DoDi
Ikke
2010-02-09 20:24:05 UTC
Permalink
Hans-Peter Diettrich <***@aol.com> wrote in news:***@mid.individual.net:

<snip>
Post by Hans-Peter Diettrich
That's UTF-16 encoding, known in Delphi as WideString or, since D2009,
as UnicodeString. I guess that you are using some older (Ansi) version
of Delphi?
I'm using CodeGear Delphi 2007 for this project.
Post by Hans-Peter Diettrich
I'd use an XML component to deal with XML files. Search for a TXML...
class in your online help.
Thanks - I'll have a look at it! I fear it won't work though, as the file
isn't actually a valid xml file, just a webserver response in xml-like
structure.

One more (strange) thing I noticed, though... When I try to read the file
as a series of bytes, the unicode header isn't being read. Is that
normal?

Consider the following code:

fb : File of byte;
b : byte;
AssignFile(fb, 'c:\1.xml');
ReSet(fb);
while Not(eof(fb)) do
begin
Read(fb, b);
end;
CloseFile(fb);

If I view the file with a hex editor, these are the first bytes:
FFFE 3C00 7200 6500 7300 7000 2000 7300

When I run the above code, however, I get:
3C 72 65 73 70 20 73

I don't understand why the code above skips the first (or the first two)
byte(s)... Now I'm just reading the file, what if I were writing a copy
routine? Wouldn't this damage the header of the file?

Thanks,

Ikke
Maarten Wiltink
2010-02-09 21:58:21 UTC
Permalink
[...]
Post by Ikke
Post by Hans-Peter Diettrich
I'd use an XML component to deal with XML files. Search for a TXML...
class in your online help.
Thanks - I'll have a look at it! I fear it won't work though, as the
file isn't actually a valid xml file, just a webserver response in
xml-like structure.
It may be a well-formed XML document fragment, which is something an
XML parser should have no trouble with.


[...]
Post by Ikke
FFFE 3C00 7200 6500 7300 7000 2000 7300
3C 72 65 73 70 20 73
One of them is lying. The hex editor may be trying to be smarter than
you. You *know* what that code does.

Groetjes,
Maarten Wiltink
Hans-Peter Diettrich
2010-02-10 08:49:58 UTC
Permalink
Post by Maarten Wiltink
Post by Ikke
FFFE 3C00 7200 6500 7300 7000 2000 7300
3C 72 65 73 70 20 73
One of them is lying. The hex editor may be trying to be smarter than
you. You *know* what that code does.
In both cases you can't know what the code really does, unless you have
the (RTL...) source code, or single-step through the code.

I wonder what a hex *editor* should do, when one tries to replace the
BOM by something different, when the displayed BOM were *not* part of
the file.

DoDi
Hans-Peter Diettrich
2010-02-10 08:39:48 UTC
Permalink
Post by Ikke
I'm using CodeGear Delphi 2007 for this project.
Ah, I don't know about the internals of this specific version :-(
Post by Ikke
One more (strange) thing I noticed, though... When I try to read the file
as a series of bytes, the unicode header isn't being read. Is that
normal?
fb : File of byte;
b : byte;
AssignFile(fb, 'c:\1.xml');
[...]

File Of Byte should be the appropriate file type, but obviously...
Post by Ikke
FFFE 3C00 7200 6500 7300 7000 2000 7300
3C 72 65 73 70 20 73
I don't understand why the code above skips the first (or the first two)
byte(s)...
The file driver detected the BOM at the begin of the file, and decided
to switch into UTF16-to-Ansi conversion mode.

I don't understand, too, why a file of BYTE should ever be subject to
Unicode/Ansi translation :-(
Post by Ikke
Now I'm just reading the file, what if I were writing a copy
routine? Wouldn't this damage the header of the file?
I don't use the old file I/O for several reasons, the new stream I/O
(TStream, TFileStream...) is much more transparent. These classes should
have an Encoding property in newer Delphi versions, that allows to
determine or set the desired encoding of text files/streams.

IMO the file I/O is broken, when it does not definitely read bytes from
a file of bytes :-(

DoDi
Loading...