• ISO 8859-1 ("Latin 1") (was: Recent history of vi)

    From Michael Bäuerle@3:633/10 to All on Wed Nov 19 14:58:00 2025
    Carlos E.R. wrote:
    On 2025-11-18 20:04, Johnny Billquist wrote:
    On 2025-11-16 21:59, Lawrence D?Oliveiro wrote:
    On 16 Nov 2025 20:19:12 GMT, Ted Nolan <tednolan> wrote:

    Lack of utf-8 would be an issue for some things, but mostly not.

    Without UTF-8, you could not have ??? or ??? or ?ñ? or those curly quotes.

    Of course you could.
    They exist just fine in Latin-1 (hmm, maybe not the quotes...).

    As noted by others in this thread, ??? is not available with it.

    But with the transmission you have to transmit first what charset you
    are going to use, and then you are limited by it, and the recipient must
    have the same map, and be able to use it. Perhaps he has to use his own
    map instead.

    ISO 8859-1 ("Latin 1") is a special case. No mapping table is required
    for conversion to Unicode, because all ISO 8859-1 codepoints have 1:1
    mappings to Unicode codepoints. This means any UTF can be directly
    applied to ISO 8859-1 codepoints.

    This means, for the characters from this thread, it is sufficient to
    look at their Unicode codepoints: +-----------+-------------------+-------------------------------------+
    | Character | Unicode codepoint | ISO 8859-1 codepoint (hexadecimal) | +-----------+-------------------+-------------------------------------+
    | ? | U+20AC | [not available] |
    | ? | U+00A9 | A9 |
    | ñ | U+00B1 | B1 | +-----------+-------------------+-------------------------------------+

    Any Unicode codepoint up to U+00FF is also present in ISO 8859-1 [1],
    or the C0 and C1 control characters [2], with the same value.

    The MIME declaration "ISO-8859-1" includes CO and C1 control characters.


    ______________
    [1] <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>
    [2] <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>

    --- PyGate Linux v1.5
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Eli the Bearded@3:633/10 to All on Thu Nov 20 02:09:45 2025
    In comp.os.linux.misc, Michael B„uerle <michael.baeuerle@gmx.net> wrote:
    ISO 8859-1 ("Latin 1") is a special case. No mapping table is required
    for conversion to Unicode, because all ISO 8859-1 codepoints have 1:1 mappings to Unicode codepoints. This means any UTF can be directly
    applied to ISO 8859-1 codepoints.
    ...
    The MIME declaration "ISO-8859-1" includes CO and C1 control characters.

    Be technical. The MIME charset ISO-8859-1 includes the CO and C1
    control characters and has all of its characters at the same codepoints
    as Unicode but the character encoding is different from all Unicode
    character encodings.

    "charset" is a very specific term from MIME and it conflates character
    set with character encoding. In a world were all characters fit in
    eight bits, that's a very easy mistake to make, but since the MIME
    designers were aware of (and specifically working to accomodate) worlds
    where 8-bit encodings might not be used, that's was a poor choice.

    charset="utf-8" is an encoding using variable lengths for all of the
    codepoints in the Unicode character set. In UTF-8, codepoints that
    are under 128 are encoded in a single octet with the highbit unset. All codepoints over 127 are encoded in multiple octets all with the highbit
    set.

    charset="utf-7" is an encoding using variable lengths for many of the codepoints in the Unicode character set. In UTF-7 some characters are
    left as is, some characters (those above codepoint 65535) cannot be represented, and many characters are multibyte sequences. But
    critically, none of the bytes have the highbit set.

    charset="utf-ebcdic" is an encoding using variable lengths for all of
    the codepoints in the Unicode character set. In UTF-EBCDIC an encoding
    very similar to UTF-8 encodes Unicode codepoints five bits at a time
    into EBCDIC. Codepoints that are under 160 are encoded in a single octet
    and codepoints above 159 are encoded in multiple octets all with the
    highbit set. Only the C1 control chacters are native highbit set EBCDIC.

    Elijah
    ------
    here is the map to the map you want

    --- PyGate Linux v1.5
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Peter Flass@3:633/10 to All on Wed Nov 19 20:16:42 2025
    On 11/19/25 19:09, Eli the Bearded wrote:

    charset="utf-ebcdic" is an encoding using variable lengths for all of
    the codepoints in the Unicode character set. In UTF-EBCDIC an encoding
    very similar to UTF-8 encodes Unicode codepoints five bits at a time
    into EBCDIC. Codepoints that are under 160 are encoded in a single octet
    and codepoints above 159 are encoded in multiple octets all with the
    highbit set. Only the C1 control chacters are native highbit set EBCDIC.


    That sounds like a particularly bad choice. above 159 includes lowercase
    s-z, all uppercase, and all numerics. Under 160 are only lowercase a-r
    and specials. Personally I'd have chosen 128 and above as single bytes, possibly biased (i.e all alphabetics and numerics), and 0-127 as
    multiple bytes (special characters).


    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Richard Kettlewell@3:633/10 to All on Thu Nov 20 08:47:21 2025
    Peter Flass <Peter@Iron-Spring.com> writes:
    On 11/19/25 19:09, Eli the Bearded wrote:
    charset="utf-ebcdic" is an encoding using variable lengths for all of
    the codepoints in the Unicode character set. In UTF-EBCDIC an
    encoding very similar to UTF-8 encodes Unicode codepoints five bits
    at a time into EBCDIC. Codepoints that are under 160 are encoded in a
    single octet and codepoints above 159 are encoded in multiple octets
    all with the highbit set. Only the C1 control chacters are native
    highbit set EBCDIC.

    That sounds like a particularly bad choice. above 159 includes
    lowercase s-z, all uppercase, and all numerics. Under 160 are only
    lowercase a-r and specials. Personally I'd have chosen 128 and above
    as single bytes, possibly biased (i.e all alphabetics and numerics),
    and 0-127 as multiple bytes (special characters).

    There are no good choices involving ECBDIC.

    --
    https://www.greenend.org.uk/rjk/

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From The Natural Philosopher@3:633/10 to All on Thu Nov 20 11:10:29 2025
    On 20/11/2025 08:47, Richard Kettlewell wrote:

    There are no good choices involving ECBDIC.


    ROFLMAO....
    --
    "An intellectual is a person knowledgeable in one field who speaks out
    only in others...?

    Tom Wolfe


    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charlie Gibbs@3:633/10 to All on Thu Nov 20 17:57:31 2025
    On 2025-11-20, The Natural Philosopher <tnp@invalid.invalid> wrote:

    On 20/11/2025 08:47, Richard Kettlewell wrote:

    There are no good choices involving ECBDIC.

    ROFLMAO....

    Taken from Ted Nelson's _Computer Lib_:

    ASCII and ye shall receive.
    -- the computer industry

    ASCII not, what your machine can do for you.
    -- IBM

    A TA in one of my computer science classes pronounced EBCDIC as "ee-biddy-dick".

    --
    /~\ Charlie Gibbs | Growth for the sake of
    \ / <cgibbs@kltpzyxm.invalid> | growth is the ideology
    X I'm really at ac.dekanfrus | of the cancer cell.
    / \ if you read it the right way. | -- Edward Abbey

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Ralf Fassel@3:633/10 to All on Fri Nov 21 12:24:21 2025
    * Charlie Gibbs <cgibbs@kltpzyxm.invalid>
    | Taken from Ted Nelson's _Computer Lib_:

    | ASCII and ye shall receive.
    | -- the computer industry

    | ASCII not, what your machine can do for you.
    | -- IBM

    ASCII stupid question, get a stupid ANSI
    -- [from someones .sig]

    R'

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Nuno Silva@3:633/10 to All on Fri Nov 21 23:20:42 2025
    On 2025-11-21, Niklas Karlsson wrote:

    On 2025-11-21, St‚phane CARPENTIER <sc@fiat-linux.fr> wrote:
    Le 18-11-2025, Eli the Bearded <*@eli.users.panix.com> a ‚critÿ:
    In comp.os.linux.misc, Johnny Billquist <bqt@softjar.se> wrote:
    On 2025-11-16 21:59, Lawrence D?Oliveiro wrote:
    Without UTF-8, you could not have ??? or ??? or ?ñ? or those curly quotes.
    Of course you could.
    They exist just fine in Latin-1 (hmm, maybe not the quotes...).

    The Latin-1 I know does not have a Euro symbol. It does have the generic >>> currency placeholder at 0xA5: ?

    They created the latin9 from the latin1 to add this ? symbol.

    I thought that was Latin-15.

    Niklas

    It seems it's both latin9 and iso8859-15:

    https://jkorpela.fi/latin9.html

    I was wondering why "latin15" didn't bring it up in some context the
    other day, I guess this is why?

    (On this system, I apparently can also open the online manual page for iso_8859-15 using the name "latin9".)

    --
    Nuno Silva

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From John Levine@3:633/10 to All on Mon Nov 24 01:45:52 2025
    According to Carlos E.R. <robin_listas@es.invalid>:
    You don?t have to go very far from there to find ones that were a little
    harder to deal with ...

    It amazes me that computers can handle Chinese. Not only display, but >keyboards.

    Actually, there aren't Chinese keyboards. While there were some impressive attempts at electromechanical Chinese typewriters in the 20th c., these days the way one types Chinese is to type the pinyin transliteration and the
    input software figures out the characters. When there are multiple characters with the same pinyin it can usually tell from context which one makes sense,
    or if need be it'll pop up a question box and the user picks the correct one.

    Japanese has two phonetic alphabets, hiragana amd katakana, so that's
    what people type, with a similar scheme turning them into kanji
    characters.

    Displaying Chinese and Japanese is relatively straightforward since
    there are Unicode code points for all of the characters that are in
    common use, known as the CJK Unified Ideographs. But Chinese has a lot
    of obscure rarely used characters and there is a huge backlog of them
    still proposed to be added to Unicode.

    If you are interested in this topic, read this excellent book:

    https://en.wikipedia.org/wiki/Kingdom_of_Characters




    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Bobbie Sellers@3:633/10 to All on Sun Nov 23 18:06:20 2025


    On 11/23/25 17:45, John Levine wrote:
    According to Carlos E.R. <robin_listas@es.invalid>:
    You don?t have to go very far from there to find ones that were a little >>> harder to deal with ...

    It amazes me that computers can handle Chinese. Not only display, but
    keyboards.

    Actually, there aren't Chinese keyboards. While there were some impressive attempts at electromechanical Chinese typewriters in the 20th c., these days the way one types Chinese is to type the pinyin transliteration and the
    input software figures out the characters. When there are multiple characters
    with the same pinyin it can usually tell from context which one makes sense, or if need be it'll pop up a question box and the user picks the correct one.

    Japanese has two phonetic alphabets, hiragana amd katakana, so that's
    what people type, with a similar scheme turning them into kanji
    characters.

    Yes but the 2000 Kanji are essential to be considered literate. To add to the fun the
    kanji may be used it various ways to indicate the desired pronounciation
    and whether
    a word is an adaptation of a word not found in Japanese language and
    these are
    shown as superscripts set above the first letter. Originally Japanese
    was written in
    Chinese but the pronouciation changed. Then hiragana was invented and
    it became
    an item of artistic interest with some very difficult to read scripts
    being used in
    succeeding centuries and the schools of calligraphy.

    Displaying Chinese and Japanese is relatively straightforward since
    there are Unicode code points for all of the characters that are in
    common use, known as the CJK Unified Ideographs. But Chinese has a lot
    of obscure rarely used characters and there is a huge backlog of them
    still proposed to be added to Unicode.

    If you are interested in this topic, read this excellent book:

    https://en.wikipedia.org/wiki/Kingdom_of_Characters


    bliss

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Lawrence D?Oliveiro@3:633/10 to All on Mon Nov 24 02:13:59 2025
    On Sun, 23 Nov 2025 18:06:20 -0800, Bobbie Sellers wrote:

    Originally Japanese was written in Chinese but the pronouciation
    changed.

    Japanese was an entirely different language, which adopted Chinese writing
    in lieu of having its own script. The Koreans and Vietnamese started out
    doing the same thing, but the Koreans invented their own syllabic-based
    script in the 13th century or so, and switched wholesale to that. The Vietnamese were colonized (for a while) by the French, who introduced a Roman-based rendition of the language, complete with funny squiggles here
    and there to denote tones of the tonal language, plus some other sound distinctions (e.g. ??? versus ?d?).

    I guess the only Koreans and Vietnamese who need to understand the old Chinese-based script for their respective languages would be those dealing with old historical documents.

    Meanwhile, the Japanese stuck with the Chinese script, only adding a few complications (like two different syllabic-based character sets, as well
    as the Roman alphabet) on top of that.

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From John Levine@3:633/10 to All on Mon Nov 24 02:23:11 2025
    According to Bobbie Sellers <blissInSanFrancisco@mouse-potato.com>:
    Japanese has two phonetic alphabets, hiragana amd katakana, so that's
    what people type, with a similar scheme turning them into kanji
    characters.

    Yes but the 2000 Kanji are essential to be considered literate.

    Indeed, but the question was about how do you type Japanese, not how
    do you read it.

    To add >to the fun the
    kanji may be used it various ways to indicate the desired pronounciation
    and whether a word is an adaptation of a word not found in Japanese language and
    these are shown as superscripts set above the first letter.

    I don't know Japanese well enough to say how if at all one would type the superscripts.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From John Levine@3:633/10 to All on Fri Dec 5 01:59:06 2025
    According to Johnny Billquist <bqt@softjar.se>:
    A big part of the problem is that Unicode don't even seem to have known
    what problem is was supposed to solve. Was it about representing
    different characters that have different meanings? Was it about
    representing same characters but with different visual effects? Was it >supposed to be some kind of generic system to modify characters through
    some clever system design?

    Nope.

    Unicode is a typesetting language. Its goal is to represent every written language that people use, and it does that quite well. When you're setting type,
    the goal is to make the result look correct, and you do not care how you do that. For example, I am old enough to remember manual typewriters that only had the digits 2 through 9, because you used lower case "l" and capital "O" for digits
    1 and O. That was fine, on those typewriters they looked the same.

    The problem is that we want to represent identifiers in a unique way, which both
    means that there is only one way to represent a particular identifier, and that there aren't two representations that look the same. It shouldn't be surprising
    that Unicode doesn't do either of those, so we have been coming up with kludges for the past decade to try and fake it.

    The reason we use Unicode is that while it sucks for identifiers, all of the alternatives are even worse.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Nuno Silva@3:633/10 to All on Fri Dec 5 10:14:24 2025
    On 2025-12-05, John Levine wrote:

    According to Johnny Billquist <bqt@softjar.se>:
    A big part of the problem is that Unicode don't even seem to have known >>what problem is was supposed to solve. Was it about representing
    different characters that have different meanings? Was it about >>representing same characters but with different visual effects? Was it >>supposed to be some kind of generic system to modify characters through >>some clever system design?


    Nope.

    Unicode is a typesetting language. Its goal is to represent every written language that people use, and it does that quite well. When you're setting type,
    the goal is to make the result look correct, and you do not care how you do that. For example, I am old enough to remember manual typewriters that only had
    the digits 2 through 9, because you used lower case "l" and capital "O" for digits
    1 and O. That was fine, on those typewriters they looked the same.

    The problem is that we want to represent identifiers in a unique way, which both
    means that there is only one way to represent a particular identifier, and that
    there aren't two representations that look the same. It shouldn't be surprising
    that Unicode doesn't do either of those, so we have been coming up with kludges
    for the past decade to try and fake it.

    The reason we use Unicode is that while it sucks for identifiers, all of the alternatives are even worse.

    It also at least provides a way to get the meaning of glyphs even when
    the font can't show them or the display encoding can't carry
    them. Meaning one could in theory have software that'd replace
    unreadable emojis by their text representation.

    Especially good ol' "no 18" :-P

    I mean, I could e.g. do M-x describe-char on 'ñ' and it tells me
    PLUS-MINUS SIGN / PLUS-OR-MINUS SIGN. Not a big problem here as it's
    readable and can be easily translated/transliterated as "+-", but it's
    probably a much bigger advantage for other glyphs.

    I do prefer more legible displays like :-) and :-( and so on, but I can
    see value in having a glyph that specifically says it's that.

    --
    Nuno Silva

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Richard Kettlewell@3:633/10 to All on Fri Dec 5 10:35:08 2025
    John Levine <johnl@taugh.com> writes:
    According to Johnny Billquist <bqt@softjar.se>:
    A big part of the problem is that Unicode don't even seem to have
    known what problem is was supposed to solve. Was it about
    representing different characters that have different meanings? Was
    it about representing same characters but with different visual
    effects? Was it supposed to be some kind of generic system to modify
    characters through some clever system design?

    The introduction to the standard covers the design goals.

    Nope.

    Unicode is a typesetting language. Its goal is to represent every
    written language that people use, and it does that quite well. When
    you're setting type, the goal is to make the result look correct, and
    you do not care how you do that. For example, I am old enough to
    remember manual typewriters that only had the digits 2 through 9,
    because you used lower case "l" and capital "O" for digits 1 and O.
    That was fine, on those typewriters they looked the same.

    I don?t fully agree with the interpretation of it as a typesetting
    standard; it explicitly disclaims any concern over visual representation
    and confines itself to interpretation of characters. The typewriter
    analogy is certainly relevant though: Unicode distinguishes capital O
    from digit 0 very clearly, whether or not your chosen font does the
    same.

    The problem is that we want to represent identifiers in a unique way,
    which both means that there is only one way to represent a particular identifier, and that there aren't two representations that look the
    same. It shouldn't be surprising that Unicode doesn't do either of
    those, so we have been coming up with kludges for the past decade to
    try and fake it.

    The reason we use Unicode is that while it sucks for identifiers, all
    of the alternatives are even worse.

    Agreed.

    --
    https://www.greenend.org.uk/rjk/

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Carlos E.R.@3:633/10 to All on Fri Dec 5 12:05:20 2025
    On 2025-12-05 02:59, John Levine wrote:
    Unicode is a typesetting language. Its goal is to represent every written language that people use, and it does that quite well. When you're setting type,
    the goal is to make the result look correct, and you do not care how you do that. For example, I am old enough to remember manual typewriters that only had
    the digits 2 through 9, because you used lower case "l" and capital "O" for digits
    1 and O. That was fine, on those typewriters they looked the same.

    You could type 0[backspace]/. Did not occur to me at the time, but most
    of what I typed was text, not math works for school. The numbers were distinguished from context, except when they weren't.

    --
    Cheers, Carlos.
    ES??, EU??;

    --- PyGate Linux v1.5.1
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)