Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).
Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a
number can have). And you cannot parse this number in a streaming way
because in order to do that, you need to start with the least
significant digit.
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
On 9/30/2024 11:30 AM, Barry via Python-list wrote:
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:
Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).
Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a number can have). And you cannot parse this number in a streaming way because in order to do that, you need to start with the least
significant digit.
Which is how arabic numbers were originally parsed, but when
westerners adopted them from a R->L written language, thet didn't flip
them around to match the L->R written language into which they were
being adopted.
So now long numbers can't be parsed as a stream in software. They
should have anticipated this problem back in the 13th century and
flipped the numbers around.
On 2024-09-30 at 11:44:50 -0400,
Grant Edwards via Python-list <python-list@python.org> wrote:
On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:
[...]
Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a
number can have). And you cannot parse this number in a streaming way
because in order to do that, you need to start with the least
significant digit.
Which is how arabic numbers were originally parsed, but when
westerners adopted them from a R->L written language, thet didn't
flip them around to match the L->R written language into which they
were being adopted.
Interesting.
So now long numbers can't be parsed as a stream in software. They
should have anticipated this problem back in the 13th century and
flipped the numbers around.
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated
result by 10 (or the appropriate base) and add the next value.
[...] But why do I need to start with the least significant digit?
But why do I need to start with the least
significant digit?
On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list <python-list@python.org> wrote:
On 9/30/2024 11:30 AM, Barry via Python-list wrote:
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.
Streaming gzip is perfectly possible. You may be thinking of PKZip
which has its EOCD at the end of the file (although it may still be
possible to stream-decompress if you work at it).
ChrisA
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
On 9/30/2024 11:30 AM, Barry via Python-list wrote:
thon-list@python.org> wrote:On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <py=
of RAM.
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB =
As later suggested a streaming parser is required.
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
--
https://mail.python.org/mailman/listinfo/python-list
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:
But why do I need to start with the least
significant digit?
If you start from the most significant, you don't know anything about
the number until you finish parsing it. There's almost nothing you can
say about a number given that it starts with a particular sequence
(since you don't know how MANY digits there are). However, if you know
the LAST digits, you can make certain statements about it (trivial
examples being whether it's odd or even).
It's not very, well, significant. But there's something to it. And it
extends nicely to p-adic numbers, which can have an infinite number of nonzero digits to the left of the decimal:
https://en.wikipedia.org/wiki/P-adic_number
In Common Lisp, integers can be written in any integer base from two
to thirty six, inclusive. So knowing the last digit doesn't tell
you whether an integer is even or odd until you know the base
anyway.
On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:
In Common Lisp, integers can be written in any integer base from two
to thirty six, inclusive. So knowing the last digit doesn't tell
you whether an integer is even or odd until you know the base
anyway.
I had to think about that for an embarassingly long time before it
clicked.
On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:
On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:
In Common Lisp, integers can be written in any integer base from two
to thirty six, inclusive. So knowing the last digit doesn't tell
you whether an integer is even or odd until you know the base
anyway.
I had to think about that for an embarassingly long time before it
clicked.
The only part I'm not clear on is what identifies the base. If you're
going to write numbers little-endian, it's not that hard to also write
them with a base indicator before the digits [...]
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file
contained a "dataset"). But why do I need to start with the least significant digit?
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
GZip is specifically designed to be streamed. So, that's not a
problem (in principle), but you would need to have a streaming GZip
parser, quick search in PyPI revealed this package: https://pypi.org/project/gzip-stream/ .
On Mon, Sep 30, 2024 at 6:20=E2=80=AFPM Thomas Passin via Python-list <python-list@python.org> wrote:
On 9/30/2024 11:30 AM, Barry via Python-list wrote:
python-list@python.org> wrote:On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <=
B of RAM.
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60Gi=
As later suggested a streaming parser is required.
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
--
https://mail.python.org/mailman/listinfo/python-list
2QdxY4RzWzUUiLuE@potatochowder.com writes:
[...]
In Common Lisp, you can write integers as #nnR[digits], where nn is the decimal representation of the base (possibly without a leading zero),
the # and the R are literal characters, and the digits are written in
the intended base. So the input #16fFFFF is read as the integer 65535.
Typo: You meant #16RFFFF, not #16fFFFF.
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).
So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
Under that constraint, I'm not sure I can parse anything. How can Iparse a string (and hand it over to an external function) until I've
On 2024-09-30 at 21:34:07 +0200,ta (60 GB) from Kenna API,"
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Da=
Left Right via Python-list <python-list@python.org> wrote:lt
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated resu=
leby 10 (or the appropriate base) and add the next value. Oh, and hand=
ofscientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out =
memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the information is useless to the external code).
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.
Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?
How much state can a parser maintain (before it invokes an external
function) and still be considered streaming? I fear that we may be
getting hung up on terminology rather than solving the problem at hand.
--
https://mail.python.org/mailman/listinfo/python-list
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet.
Consider also an interesting
consequence of SCSI not being able to have infinite words: this means, besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file!
the only way to implement fsync() in compliance with the
standard is to sync _everything_
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet.
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".
Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?
Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example. There's no problem there.
In principle, any language that has infinite words will have the same
problem with streaming [...]
[...] If you ever pondered h/w or low-level
protocols s.a. SCSI or IP [...]
The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later.
By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.
In the same email you replied to, I gave examples of languages for
which parsers can be streaming (in general): SCSI or IP.
You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.
Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)
It seems you don't understand the difference between words and
languages! In my examples, IP _protocol_ is the language, sequences of
IP packets are the words in the language. A language is amenable to
streaming if the words of the language are repetition of sequences of
symbols of the alphabet of fixed length. This is, essentially, like
saying that the words themselves are regular.
You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.
One single IP packet is all you can parse.
On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:
You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.
Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)
It seems you don't understand the difference between words and
languages! In my examples, IP _protocol_ is the language, sequences of
IP packets are the words in the language. A language is amenable to streaming if the words of the language are repetition of sequences of symbols of the alphabet of fixed length. This is, essentially, like
saying that the words themselves are regular.
One single IP packet is all you can parse. You're playing shenanigans
with words the way Humpty Dumpty does. IP packets are not sequences,
they are individuals.
ChrisA
Sysop: | Tetrazocine |
---|---|
Location: | Melbourne, VIC, Australia |
Users: | 9 |
Nodes: | 8 (0 / 8) |
Uptime: | 124:11:17 |
Calls: | 161 |
Files: | 21,502 |
Messages: | 78,959 |