• Chardet oddity

    From Albert-Jan Roskam@3:633/280.2 to All on Thu Oct 24 04:07:14 2024
    Today I used chardet.detect in the repl and it returned windows-1252
    (incorrect, because it later resulted in a UnicodeDecodeError). When I ran
    chardet as a script (which uses UniversalLineDetector) this returned
    MacRoman. Isn't charset.detect the correct way? I've used this method many
    times.
    # Interpreter
    >>> contents = open(FILENAME, "rb").read()
    >>> chardet.detect(content)
    {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
    ''}
    # Terminal
    $ python -m chardet FILENAME
    FILENAME: MacRoman with confidence 0.7167379080370483
    Thanks!
    Albert-Jan

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Stefan Ram@3:633/280.2 to All on Thu Oct 24 04:43:51 2024
    Albert-Jan Roskam <sjeik_appie@hotmail.com> wrote or quoted:
    Today I used chardet.detect in the repl and it returned windows-1252 >(incorrect, because it later resulted in a UnicodeDecodeError). When I ran >chardet as a script (which uses UniversalLineDetector) this returned >MacRoman. Isn't charset.detect the correct way? I've used this method many >times.

    Oof, that's a head-scratcher! Looks like chardet's throwing
    you a curveball. Usually, chardet.detect() is the go-to method,
    but it seems to be off its game here.

    The script version's using UniversalLineDetector under the hood
    (as you wrote), which might be giving it an edge in this case.

    It's weird that the confidence levels are so close, though.
    Maybe the file's got some quirks that are tripping up the
    simpler detect() method.

    I'd say stick with the script version for now if it's giving
    you better results.

    Here's how you can use it in your code:

    from chardet.universaldetector import UniversalDetector

    detector = UniversalDetector()
    with open(FILENAME, 'rb') as file:
    for line in file:
    detector.feed(line)
    if detector.done:
    break
    detector.close()
    print(detector.result)



    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: Stefan Ram (3:633/280.2@fidonet)
  • From Mark Bourne@3:633/280.2 to All on Thu Oct 24 06:42:00 2024
    Albert-Jan Roskam wrote:
    Today I used chardet.detect in the repl and it returned windows-1252
    (incorrect, because it later resulted in a UnicodeDecodeError). When I ran
    chardet as a script (which uses UniversalLineDetector) this returned
    MacRoman. Isn't charset.detect the correct way? I've used this method many
    times.
    # Interpreter
    >>> contents = open(FILENAME, "rb").read()
    >>> chardet.detect(content)

    Is that copy and pasted from the terminal, or retyped with possible transcription errors? As written, you've assigned the open file handle
    to `contents`, but passed `content` (with no "s") to `chardet.detect` -
    so the result would depend on whatever was previously assigned to `content`.

    {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
    ''}
    # Terminal
    $ python -m chardet FILENAME
    FILENAME: MacRoman with confidence 0.7167379080370483
    Thanks!
    Albert-Jan

    --
    Mark.

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: A noiseless patient Spider (3:633/280.2@fidonet)
  • From Roland Mueller@3:633/280.2 to All on Fri Oct 25 02:51:47 2024
    ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list ( python-list@python.org) kirjoitti:

    Today I used chardet.detect in the repl and it returned windows-1252
    (incorrect, because it later resulted in a UnicodeDecodeError). When I
    ran
    chardet as a script (which uses UniversalLineDetector) this returned
    MacRoman. Isn't charset.detect the correct way? I've used this method
    many
    times.
    # Interpreter
    >>> contents = open(FILENAME, "rb").read()
    >>> chardet.detect(content)
    {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
    'language':
    ''}
    # Terminal
    $ python -m chardet FILENAME
    FILENAME: MacRoman with confidence 0.7167379080370483
    Thanks!
    Albert-Jan


    The entry point for the module chardet is chardet.cli.chardetect:main and main() calls function description_of(lines, name).
    'lines' is an opened file in mode 'rb' and name will hold the filename.

    Following way I tried this in interactive mode: I think the crucial
    difference is that description_of(lines, name) reads
    the opened file line by line and stops after something has been detected in some line.

    When reading the whole file into the variable contents probably gives
    another result depending on the input.
    This behaviour I was not able to repeat.
    I am assuming that you used the same Python for both tests.

    from chardet.cli import chardetect
    chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
    'some file: ascii with confidence 1.0'


    Your approach
    from chardet import detect
    detect(open('/tmp/DATE','rb').read())
    {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


    from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

    def description_of(lines, name='stdin'):
    u = UniversalDetector()
    for line in lines:
    line = bytearray(line)
    u.feed(line)
    # shortcut out of the loop to save reading further - particularly useful if we read a BOM.
    if u.done:
    break
    u.close()
    result = u.result
    ...


    --
    https://mail.python.org/mailman/listinfo/python-list


    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Albert-Jan Roskam@3:633/280.2 to All on Fri Oct 25 21:31:25 2024
    On Oct 24, 2024 17:51, Roland Mueller via Python-list
    <python-list@python.org> wrote:

    ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
    python-list@python.org) kirjoitti:

    >˙˙˙ Today I used chardet.detect in the repl and it returned
    windows-1252
    >˙˙˙ (incorrect, because it later resulted in a UnicodeDecodeError).
    When I
    > ran
    >˙˙˙ chardet as a script (which uses UniversalLineDetector) this
    returned
    >˙˙˙ MacRoman. Isn't charset.detect the correct way? I've used this
    method
    > many
    >˙˙˙ times.
    >˙˙˙ # Interpreter
    >˙˙˙ >>> contents = open(FILENAME, "rb").read()
    >˙˙˙ >>> chardet.detect(content)
    >˙˙˙ {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
    > 'language':
    >˙˙˙ ''}
    >˙˙˙ # Terminal
    >˙˙˙ $ python -m chardet FILENAME
    >˙˙˙ FILENAME: MacRoman with confidence 0.7167379080370483
    >˙˙˙ Thanks!
    >˙˙˙ Albert-Jan
    >

    The entry point for the module chardet is chardet.cli.chardetect:main
    and
    main() calls function description_of(lines, name).
    'lines' is an opened file in mode 'rb' and name will hold the filename.

    Following way I tried this in interactive mode: I think the crucial
    difference is that˙ description_of(lines, name) reads
    the opened file line by line and stops after something has been detected
    in
    some line.

    When reading the whole file into the variable contents probably gives
    another result depending on the input.
    This behaviour I was not able to repeat.
    I am assuming that you used the same Python for both tests.

    >>> from chardet.cli import chardetect
    >>> chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
    'some file: ascii with confidence 1.0'
    >>>

    Your approach
    >>> from chardet import detect
    >>> detect(open('/tmp/DATE','rb').read())
    {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

    from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

    def description_of(lines, name='stdin'):
    ˙˙˙ u = UniversalDetector()
    ˙˙˙ for line in lines:
    ˙˙˙˙˙˙˙ line = bytearray(line)
    ˙˙˙˙˙˙˙ u.feed(line)
    ˙˙˙˙˙˙˙ # shortcut out of the loop to save reading further -
    particularly
    useful if we read a BOM.
    ˙˙˙˙˙˙˙ if u.done:
    ˙˙˙˙˙˙˙˙˙˙˙ break
    ˙˙˙ u.close()
    ˙˙˙ result = u.result

    =============
    Hi Mark, Roland,
    Thanks for your replies. I experimented a bit with both methods and the
    derived encoding still differed, even after I removed the "if u.done:˙
    break" (I removed that because I've seen cp1252 files with a utf8 BOM in
    the past. I kid you not!). BUT next day, at closer inspection I saw that
    the file was quite a mess. I contained mojibake. So I don't blame chardet
    for not being able to figure out the encoding.˙
    Albert-Jan

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)