On 2025-11-18 20:04, Johnny Billquist wrote:
On 2025-11-16 21:59, Lawrence D?Oliveiro wrote:
On 16 Nov 2025 20:19:12 GMT, Ted Nolan <tednolan> wrote:
Lack of utf-8 would be an issue for some things, but mostly not.
Without UTF-8, you could not have ??? or ??? or ?ñ? or those curly quotes.
Of course you could.
They exist just fine in Latin-1 (hmm, maybe not the quotes...).
But with the transmission you have to transmit first what charset you
are going to use, and then you are limited by it, and the recipient must
have the same map, and be able to use it. Perhaps he has to use his own
map instead.
ISO 8859-1 ("Latin 1") is a special case. No mapping table is required...
for conversion to Unicode, because all ISO 8859-1 codepoints have 1:1 mappings to Unicode codepoints. This means any UTF can be directly
applied to ISO 8859-1 codepoints.
The MIME declaration "ISO-8859-1" includes CO and C1 control characters.
charset="utf-ebcdic" is an encoding using variable lengths for all of
the codepoints in the Unicode character set. In UTF-EBCDIC an encoding
very similar to UTF-8 encodes Unicode codepoints five bits at a time
into EBCDIC. Codepoints that are under 160 are encoded in a single octet
and codepoints above 159 are encoded in multiple octets all with the
highbit set. Only the C1 control chacters are native highbit set EBCDIC.
On 11/19/25 19:09, Eli the Bearded wrote:
charset="utf-ebcdic" is an encoding using variable lengths for all of
the codepoints in the Unicode character set. In UTF-EBCDIC an
encoding very similar to UTF-8 encodes Unicode codepoints five bits
at a time into EBCDIC. Codepoints that are under 160 are encoded in a
single octet and codepoints above 159 are encoded in multiple octets
all with the highbit set. Only the C1 control chacters are native
highbit set EBCDIC.
That sounds like a particularly bad choice. above 159 includes
lowercase s-z, all uppercase, and all numerics. Under 160 are only
lowercase a-r and specials. Personally I'd have chosen 128 and above
as single bytes, possibly biased (i.e all alphabetics and numerics),
and 0-127 as multiple bytes (special characters).
There are no good choices involving ECBDIC.
On 20/11/2025 08:47, Richard Kettlewell wrote:
There are no good choices involving ECBDIC.
ROFLMAO....
On 2025-11-21, St‚phane CARPENTIER <sc@fiat-linux.fr> wrote:
Le 18-11-2025, Eli the Bearded <*@eli.users.panix.com> a ‚critÿ:
In comp.os.linux.misc, Johnny Billquist <bqt@softjar.se> wrote:
On 2025-11-16 21:59, Lawrence D?Oliveiro wrote:
Without UTF-8, you could not have ??? or ??? or ?ñ? or those curly quotes.Of course you could.
They exist just fine in Latin-1 (hmm, maybe not the quotes...).
The Latin-1 I know does not have a Euro symbol. It does have the generic >>> currency placeholder at 0xA5: ?
They created the latin9 from the latin1 to add this ? symbol.
I thought that was Latin-15.
Niklas
You don?t have to go very far from there to find ones that were a little
harder to deal with ...
It amazes me that computers can handle Chinese. Not only display, but >keyboards.
According to Carlos E.R. <robin_listas@es.invalid>:
You don?t have to go very far from there to find ones that were a little >>> harder to deal with ...
It amazes me that computers can handle Chinese. Not only display, but
keyboards.
Actually, there aren't Chinese keyboards. While there were some impressive attempts at electromechanical Chinese typewriters in the 20th c., these days the way one types Chinese is to type the pinyin transliteration and the
input software figures out the characters. When there are multiple characters
with the same pinyin it can usually tell from context which one makes sense, or if need be it'll pop up a question box and the user picks the correct one.
Japanese has two phonetic alphabets, hiragana amd katakana, so that's
what people type, with a similar scheme turning them into kanji
characters.
Displaying Chinese and Japanese is relatively straightforward sincethere are Unicode code points for all of the characters that are in
common use, known as the CJK Unified Ideographs. But Chinese has a lot
of obscure rarely used characters and there is a huge backlog of them
still proposed to be added to Unicode.
If you are interested in this topic, read this excellent book:
https://en.wikipedia.org/wiki/Kingdom_of_Characters
Originally Japanese was written in Chinese but the pronouciation
changed.
Japanese has two phonetic alphabets, hiragana amd katakana, so that's
what people type, with a similar scheme turning them into kanji
characters.
Yes but the 2000 Kanji are essential to be considered literate.
To add >to the fun the
kanji may be used it various ways to indicate the desired pronounciation
and whether a word is an adaptation of a word not found in Japanese language and
these are shown as superscripts set above the first letter.
A big part of the problem is that Unicode don't even seem to have known
what problem is was supposed to solve. Was it about representing
different characters that have different meanings? Was it about
representing same characters but with different visual effects? Was it >supposed to be some kind of generic system to modify characters through
some clever system design?
According to Johnny Billquist <bqt@softjar.se>:
A big part of the problem is that Unicode don't even seem to have known >>what problem is was supposed to solve. Was it about representing
different characters that have different meanings? Was it about >>representing same characters but with different visual effects? Was it >>supposed to be some kind of generic system to modify characters through >>some clever system design?
Nope.
Unicode is a typesetting language. Its goal is to represent every written language that people use, and it does that quite well. When you're setting type,
the goal is to make the result look correct, and you do not care how you do that. For example, I am old enough to remember manual typewriters that only had
the digits 2 through 9, because you used lower case "l" and capital "O" for digits
1 and O. That was fine, on those typewriters they looked the same.
The problem is that we want to represent identifiers in a unique way, which both
means that there is only one way to represent a particular identifier, and that
there aren't two representations that look the same. It shouldn't be surprising
that Unicode doesn't do either of those, so we have been coming up with kludges
for the past decade to try and fake it.
The reason we use Unicode is that while it sucks for identifiers, all of the alternatives are even worse.
According to Johnny Billquist <bqt@softjar.se>:
A big part of the problem is that Unicode don't even seem to have
known what problem is was supposed to solve. Was it about
representing different characters that have different meanings? Was
it about representing same characters but with different visual
effects? Was it supposed to be some kind of generic system to modify
characters through some clever system design?
Nope.
Unicode is a typesetting language. Its goal is to represent every
written language that people use, and it does that quite well. When
you're setting type, the goal is to make the result look correct, and
you do not care how you do that. For example, I am old enough to
remember manual typewriters that only had the digits 2 through 9,
because you used lower case "l" and capital "O" for digits 1 and O.
That was fine, on those typewriters they looked the same.
The problem is that we want to represent identifiers in a unique way,
which both means that there is only one way to represent a particular identifier, and that there aren't two representations that look the
same. It shouldn't be surprising that Unicode doesn't do either of
those, so we have been coming up with kludges for the past decade to
try and fake it.
The reason we use Unicode is that while it sucks for identifiers, all
of the alternatives are even worse.
Unicode is a typesetting language. Its goal is to represent every written language that people use, and it does that quite well. When you're setting type,
the goal is to make the result look correct, and you do not care how you do that. For example, I am old enough to remember manual typewriters that only had
the digits 2 through 9, because you used lower case "l" and capital "O" for digits
1 and O. That was fine, on those typewriters they looked the same.
| Sysop: | Tetrazocine |
|---|---|
| Location: | Melbourne, VIC, Australia |
| Users: | 14 |
| Nodes: | 8 (0 / 8) |
| Uptime: | 135:00:30 |
| Calls: | 185 |
| Calls today: | 1 |
| Files: | 21,502 |
| Messages: | 82,193 |