On Fri, 12/26/2025 10:13 PM, Lawrence D?Oliveiro wrote:
On Sun, 7 Dec 2025 19:01:02 +0000, Richard Harnden wrote:
A text file is supposed to end with a '\n'
PDF files end with that. The object index comes at the end, and each
index entry is fixed in length and ends with \015\012.
But the spec makes it very clear that PDF files are not supposed to be treated as text files.
The best you can do, is for the PDF to be entirely text except for
some bytes near the top (second line). It's not exactly clear what they do,
but I've seen at least one document that misses the binary line. That binary-thing could be a hash over the document.
At least in this PDF, the document is 99% text. And Mutool can be
used to convert a "mostly binary" PDF, into a "mostly text" PDF.
If a PDF is encrypted, it is unlikely to have a textual representation
when naively opening it.
PDFs can be "anywhere from 99% binary to 99% text". It all depends.
Generally, the ones that are mostly text are the simplest of documents.
Rich media documents will have a lot more binary that cannot be
simplified by simple transformations. You could start in the first place,
by using different source materials that had closer-to-textual representation to fix that.
***********************************************************************************************************
%PDF-1.4
<=== these can "look like binary" "25 B8 9A 92 9D 0A"
1 0 obj<</Type/Catalog/Pages 3 0 R>>
endobj
2 0 obj<</Producer(GemBox GemBox.Pdf 1.7 (17.0.35.1042; .NET Framework))/CreationDate(D:20211028151721+02'00')>>
endobj
3 0 obj<</Type/Pages/Kids[4 0 R]/Count 1/MediaBox[0 0 595.32 841.92]>>
endobj
4 0 obj<</Type/Page/Parent 3 0 R/Resources<</Font<</F0 6 0 R>>>>/Contents 5 0 R>>
endobj
5 0 obj<</Length 59>>stream
BT
/F0 12 Tf
1 0 0 1 100 702.7366667 Tm
(Hello World!)Tj
ET
endstream
endobj
6 0 obj<</Type/Font/Subtype/Type1/BaseFont/Helvetica/FirstChar 32/LastChar 114/Widths 7 0 R/FontDescriptor 8 0 R>>
endobj
7 0 obj[278 278 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 722 0 0 0 0 0 0 0 0 0 0 0 0 0 0 944 0 0 0 0 0 0 0 0 0 0 0 0 556 556 0 0 0 0 0 0 222 0 0 556 0 0 333]
endobj
8 0 obj<</Type/FontDescriptor/Flags 32/FontName/Helvetica/FontFamily(Helvetica)/FontWeight 500/ItalicAngle 0/FontBBox[-166 -225 1000 931]/CapHeight 718/XHeight 523/Ascent 718/Descent -207/StemH 76/StemV 88>>
endobj
xref
0 9
0000000000 65535 f
0000000015 00000 n
0000000059 00000 n
0000000179 00000 n
0000000257 00000 n
0000000346 00000 n
0000000451 00000 n
0000000573 00000 n
0000000773 00000 n
trailer
<</Root 1 0 R/ID[<9392A59F3BE7B840805D62746E8A4F29><9392A59F3BE7B840805D62746E8A4F29>]/Info 2 0 R/Size 9>>
startxref
988
%%EOF ***********************************************************************************************************
If "there has to be binary in it", it's on the second line.
The other lines can be text... if the tools and print drivers
wanted to do it that way.
Paul
--- PyGate Linux v1.5.2
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)