• Re: Help with Streaming and Chunk Processing for Large JSON Data (60 G

    From 2QdxY4RzWzUUiLuE@potatochowder.com@3:633/280.2 to All on Wed Oct 2 10:20:59 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    On 2024-10-01 at 23:03:01 +0200,
    Left Right <olegsivokon@gmail.com> wrote:

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.

    And what is that external function going to do with this information?
    The point is you didn't parse anything if you just sent the digit.
    You just delegated the parsing further. Parsing is only meaningful if
    you extracted some information, but your idea is, essentially "what if
    I do nothing?".

    If the parser detects the first digit of a number, then the parser can
    read digits one at a time (i.e., "streaming"), assimilate and accumulate
    the value of the number being parsed, and successfully finish parsing
    the number it reads a non-digit. Whether the function that accumulates
    the value during the process is internal or external isn't relevant; the
    point is that it is possible to parse integers from most significant
    digit to least significant digit under a streaming model (and if you're sufficiently clever, you can even write partial results to external
    storage and/or another transmission protocol, thus allowing for numbers
    bigger (as measured by JSON or your internal representation) than your
    RAM).

    At most, the parser has to remember the non-digit character it read so
    that it (the parser) can begin to parse whatever comes after the number.
    Does that break your notion of "streaming"?

    Why do I have to start with the least significant digit?

    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?

    Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can
    parse booleans or null, for example. There's no problem there.

    My intent was only to repeat what you implied: that any parser that
    reads its input until it has parsed a value is not streaming.

    So how much information can the parser keep before you consider it not
    to be "streaming"?

    [...]

    In principle, any language that has infinite words will have the same
    problem with streaming [...]

    So what magic allows anyone to stream any JSON file over SCSI or IP?
    Let alone some kind of "live stream" that by definition is indefinite,
    even if it only lasts a few tenths of a second?

    [...] If you ever pondered h/w or low-level
    protocols s.a. SCSI or IP [...]

    I spent a good deal of my career designing and implementing all manner
    of communicaations protocols, from transmitting and receiving single
    bits over a wire all the way up to what are now known as session and presentation layers. Some imposed maximum lengths in certain places;
    some allowed for indefinite amounts of data to be transferred from one
    end to the other without stopping, resetting, or overflowing. And yet
    somehow, the universe never collapsed.

    If you believe that some implementation of fsync fails to meet a
    specification, or fails to work correctly on files containign JSON, then
    file a bug report.

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Greg Ewing@3:633/280.2 to All on Wed Oct 2 15:27:54 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    On 2/10/24 12:26 pm, avi.e.gross@gmail.com wrote:
    The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later.

    Yes, if you want to process the items as they come in, you might
    be better off sending a series of separate JSON strings, rather than
    one JSON string containing a list.

    Or, use a specialised JSON parser that processes each item of the
    list as soon as it's finished parsing it, instead of collecting the
    whole list first.

    --
    Greg


    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Left Right@3:633/280.2 to All on Wed Oct 2 16:05:02 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    By that definition of "streaming", no parser can ever be streaming,
    because there will be some constructs that must be read in their
    entirety before a suitably-structured piece of output can be
    emitted.

    In the same email you replied to, I gave examples of languages for
    which parsers can be streaming (in general): SCSI or IP. For some
    languages (eg. everything in the context-free family) streaming
    parsers are _in general_ impossible, because there are pathological
    cases like the one with parsing numbers. But this doesn't mean that
    you cannot come up with a parser that is only useful _sometimes_.
    And, in practice, languages like XML or JSON do well with streaming,
    even though in general it's impossible.

    I'm sorry if this comes as a surprise. On one hand I don't want to
    sound condescending, on the other hand, this is something that you'd
    typically study in automata theory class. Well, not exactly in the
    very same words, but you should be able to figure this stuff out if
    you had that class.

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Chris Angelico@3:633/280.2 to All on Wed Oct 2 23:59:41 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    On Wed, 2 Oct 2024 at 23:53, Left Right via Python-list <python-list@python.org> wrote:
    In the same email you replied to, I gave examples of languages for
    which parsers can be streaming (in general): SCSI or IP.

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    ChrisA

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Chris Angelico@3:633/280.2 to All on Thu Oct 3 08:51:01 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)

    It seems you don't understand the difference between words and
    languages! In my examples, IP _protocol_ is the language, sequences of
    IP packets are the words in the language. A language is amenable to
    streaming if the words of the language are repetition of sequences of
    symbols of the alphabet of fixed length. This is, essentially, like
    saying that the words themselves are regular.

    One single IP packet is all you can parse. You're playing shenanigans
    with words the way Humpty Dumpty does. IP packets are not sequences,
    they are individuals.

    ChrisA

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Left Right@3:633/280.2 to All on Thu Oct 3 08:48:10 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    Whoa, whoa, hold your horses! "nonsensical" needs a little bit of
    justification :)

    It seems you don't understand the difference between words and
    languages! In my examples, IP _protocol_ is the language, sequences of
    IP packets are the words in the language. A language is amenable to
    streaming if the words of the language are repetition of sequences of
    symbols of the alphabet of fixed length. This is, essentially, like
    saying that the words themselves are regular.

    So, the follow-up question from you to me should be: how come strictly context-free languages can still be parsed with streaming parsers? --
    And the answer to that is it's possible to approximate context-free
    languages with regular languages. In fact, this is a very interesting
    subject, which unfortunately is usually overlooked in automata
    classes. It's interesting in a sense that it's very accessible to the
    students who already mastered the understanding of regular and
    context-free formalisms.

    So, streaming parsers (eg. SAX) are written for a regular language
    that approximates XML. This is because in practice we will almost
    never encounter more than N nesting levels in an XML, more than N
    characters in an element name etc. (for some large enough N).
    Something which allows us to create a regular language from a
    context-free one.

    NB. "Nonsensical" has a very precise meaning, when it comes to
    discussing the truth value of a proposition, which I think you also
    somehow didn't know about. You seem to use "nonsensical" as a synonym
    to "wrong". But, unbeknownst to you, you said something else. You
    actually implied that there's no way to tell if my notion of streaming
    is correct or not.

    But, for the future reference: my notion of streaming is correct, and
    you would do better learning some materials about it before jumping to conclusions.

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Left Right@3:633/280.2 to All on Thu Oct 3 08:56:36 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    One single IP packet is all you can parse.

    I worked for an undisclosed company which manufactures h/w for ISPs
    (4- and 8-unit boxes you mount on a rack in a datacenter).
    Essentially, big-big routers. So, I had the pleasure of writing
    software that parses IP _protocol_, and let me tell you: you have no
    idea what you just wrote.

    But, like I wrote earlier: you don't understand the distinction
    between languages and words. And in general, are just being stubborn
    and rude because you are trying to prove a point to someone you don't
    like, but, in reality, you just look more and more ridiculous.

    On Thu, Oct 3, 2024 at 12:51=E2=80=AFAM Chris Angelico <rosuav@gmail.com> w= rote:

    On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)

    It seems you don't understand the difference between words and
    languages! In my examples, IP _protocol_ is the language, sequences of
    IP packets are the words in the language. A language is amenable to streaming if the words of the language are repetition of sequences of symbols of the alphabet of fixed length. This is, essentially, like
    saying that the words themselves are regular.

    One single IP packet is all you can parse. You're playing shenanigans
    with words the way Humpty Dumpty does. IP packets are not sequences,
    they are individuals.

    ChrisA

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Ethan Furman@3:633/280.2 to All on Thu Oct 3 11:57:51 2024
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
    GB) from Kenna API

    This thread is derailing.

    Please consider it closed.

    --
    ~Ethan~
    Moderator

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)