• Re: Get to know your files and folders!

    From Lawrence D?Oliveiro@3:633/10 to All on Tue Jan 27 21:02:55 2026
    On Mon, 26 Jan 2026 13:28:24 -0500, DFS wrote:

    Here's some Python code I wrote to capture file metadata (name,
    location, date created, date modified, and size) in a SQLite
    database.

    I would consider this a waste of time. There are already standard *nix
    commands (e.g. du(1) <https://manpages.debian.org/du(1)>) for
    obtaining this information directly from the filesystem, without the
    extra steps of collecting the info in a database and having to keep
    that up to date.

    Tested on Windows and Linux/WSL.

    But not on native Linux? Because WSL forces the Linux kernel to go
    through the filesystem-handling bottleneck that is the Windows kernel.

    Just some thoughts:

    cSQL = " CREATE TABLE Files "
    cSQL += " ( "
    cSQL += " FileID INTEGER NOT NULL PRIMARY KEY, "
    cSQL += " FolderID INTEGER REFERENCES Folders (FolderID), "
    cSQL += " Folder TEXT NOT NULL, "
    cSQL += " FileName TEXT NOT NULL, "
    cSQL += " FileCreated NUMBER NOT NULL, "
    cSQL += " FileModified NUMBER NOT NULL, "
    cSQL += " FileSizeKB NUMBER NOT NULL "
    cSQL += " );"

    Did you know Python does implicit string concatenation, like C and
    C++?

    Also, I notice you are assuming each file has only one parent folder.
    You do know *nix systems are not restricted like this, right?

    filesize = round(os.path.getsize(root + '/' + file)/1000,1)
    filecreate = os.path.getctime(root + '/' + file)
    filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]
    filemod = os.path.getmtime(root + '/' + file)

    How many different file-info lookups do you need to do on each file?
    How do you handle symlinks? (Yes, even Windows has those now.)

    The usual way to get this info is with os.lstat() <https://docs.python.org/3/library/os.html#os.lstat>, which returns it
    all with a single OS call.

    The major slowdown is one cartesian/update query - used to summarize
    data in all subdirectories - for which I haven't been able to figure
    out a decent workaround.

    As I said, your problem is using a DBMS in the first place. You are
    doing a cross-join of *all* files against *all* folders. But in the
    real filesystem, it would be unheard of for *all* files to be present
    in *all* folders -- or indeed, for many files to be present in more
    than one folder.

    Also, I notice your database structure does not reflect the folder
    hierarchy -- where do you record parent-child relationships between
    folders?

    In short, take more account of the actual filesystem hierarchy in your
    database structure.

    --- PyGate Linux v1.5.6
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From DFS@3:633/10 to All on Tue Jan 27 23:22:15 2026
    On 1/27/2026 4:02 PM, Lawrence D?Oliveiro wrote:
    On Mon, 26 Jan 2026 13:28:24 -0500, DFS wrote:

    Here's some Python code I wrote to capture file metadata (name,
    location, date created, date modified, and size) in a SQLite
    database.

    I would consider this a waste of time. There are already standard *nix commands (e.g. du(1) <https://manpages.debian.org/du(1)>) for
    obtaining this information directly from the filesystem, without the
    extra steps of collecting the info in a database and having to keep
    that up to date.

    Yeah, I know there are various switches on Linux (and Windows) for
    examining directories and files, but none that would provide what my
    program does, as far as I know.

    So unless you can write scripts to summarize your files/folders all the
    ways I do (as shown in the VIEWs), it's not a waste of time. And even
    then, unless the data is in a db, it's nearly useless. All you can do
    is scroll it up and down on the screen, and you have to write a new
    program every time you want a different view.

    Plus once the data is in tables, you can query and sort it and
    manipulate and scroll it in all kinds of ways.

    What kind of script/commands will show the 50 Largest .zip files, in descending order by size, across a large set of subdirectories?

    Once the data is loaded, in about 2 seconds I can summarize 400K files
    by extension, with one simple query.

    And I haven't done it yet but I might: pop an Access or PyQt interface
    onto the tables and find/display/compare/open files far far faster than
    can be done with a typical file manager.

    Did you run it?


    Tested on Windows and Linux/WSL.

    But not on native Linux?

    I don't have a native Linux install.


    Because WSL forces the Linux kernel to go
    through the filesystem-handling bottleneck that is the Windows kernel.

    WSL1 might have, but WSL2 doesn't. Today the WSL2 distro lives in and
    runs from an ext4-formatted .vhdx file located at:

    C:\Users\DFS\AppData\Local\Packages\KaliLinux.54290C8133FEE_ey8k8hqnwqnmg\LocalState\ext4.vhdx

    I'd say every command-line code I've ever run (mostly C and Python) runs
    15% to 150% faster on WSL than on Windows. This python/db-api program
    runs at least twice as fast on WSL as on Windows.



    Just some thoughts:

    cSQL = " CREATE TABLE Files "
    cSQL += " ( "
    cSQL += " FileID INTEGER NOT NULL PRIMARY KEY, "
    cSQL += " FolderID INTEGER REFERENCES Folders (FolderID), "
    cSQL += " Folder TEXT NOT NULL, "
    cSQL += " FileName TEXT NOT NULL, "
    cSQL += " FileCreated NUMBER NOT NULL, "
    cSQL += " FileModified NUMBER NOT NULL, "
    cSQL += " FileSizeKB NUMBER NOT NULL "
    cSQL += " );"

    Did you know Python does implicit string concatenation, like C and
    C++?

    I've used other ways here and there, but that style of string-building
    is my habit.

    other ways that might be a little more efficient:

    cSQL = (
    " CREATE TABLE Files "
    " ( "
    " FileID INTEGER NOT NULL PRIMARY KEY, "
    " FolderID INTEGER REFERENCES Folders (FolderID), "
    " Folder TEXT NOT NULL, "
    " FileName TEXT NOT NULL, "
    " FileCreated NUMBER NOT NULL, "
    " FileModified NUMBER NOT NULL, "
    " FileSizeKB NUMBER NOT NULL "
    " );"
    )


    cSQL = " CREATE TABLE Files \
    ( \
    FileID INTEGER NOT NULL PRIMARY KEY, \
    FolderID INTEGER REFERENCES Folders (FolderID), \
    Folder TEXT NOT NULL, \
    FileName TEXT NOT NULL, \
    FileCreated NUMBER NOT NULL, \
    FileModified NUMBER NOT NULL, \
    FileSizeKB NUMBER NOT NULL \
    );"

    Definitely don't like this trailing slashes format.


    Also, I notice you are assuming each file has only one parent folder.
    You do know *nix systems are not restricted like this, right?

    I didn't know that, nor have I ever seen it in use (that I knew of).

    How do you assign one file to multiple folders, and then see the
    multiple parent folders associated with the file?



    filesize = round(os.path.getsize(root + '/' + file)/1000,1)
    filecreate = os.path.getctime(root + '/' + file)
    filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]
    filemod = os.path.getmtime(root + '/' + file)

    How many different file-info lookups do you need to do on each file?

    1. file size (os.path.getsize)
    2. create date (os.path.getctime)
    3. last mod date (os.path.getmtime)



    How do you handle symlinks? (Yes, even Windows has those now.)

    now? You mean for 17 years.

    But I personally don't use symlinks on Windows (I do use shortcuts
    sometimes), and didn't consider them for this exercise.


    The usual way to get this info is with os.lstat() <https://docs.python.org/3/library/os.html#os.lstat>, which returns it
    all with a single OS call.


    I compared them:

    original os.path code:
    filesize = round(os.path.getsize(root + '/' + file)/1000,1)
    filecreate = os.path.getctime(root + '/' + file)
    filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]
    filemod = os.path.getmtime(root + '/' + file)
    filemod = str(datetime.datetime.fromtimestamp(filemod))[0:19]

    os.lstat():
    fileStat = os.lstat(root + '/' + file)
    filesize = round(fileStat.st_size/1000,1)
    filecreate = fileStat.st_ctime
    filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]
    filemod = fileStat.st_mtime
    filemod = str(datetime.datetime.fromtimestamp(filemod))[0:19]


    27K files 91K files
    os.path : 2.3 sec 8.0 sec
    os.lstat: 1.3 sec 4.7 sec
    os.stat : 1.2 sec 4.5 sec

    Nearly 2x faster. Cool. I'll go with the lstat or stat calls from now on.


    The major slowdown is one cartesian/update query - used to summarize
    data in all subdirectories - for which I haven't been able to figure
    out a decent workaround.

    As I said, your problem is using a DBMS in the first place.

    It's not a problem. It's the BEST way.


    You are doing a cross-join of *all* files against *all* folders.

    No, it's all folders against all folders, which works fine for smaller
    counts (with 91K files and 6800 folders the entire program runs in < 10 seconds). But I have a directory containing nearly 57,000
    subdirectories, and such a cartesian query would probably never finish
    on my PC.

    v_Temp
    ------------------------------------
    SELECT
    F1.FOLDERID,
    F1.FOLDER,
    Count(F2.SUBFOLDERS) as TotalSF,
    Sum(F2.FILES) as TotalFiles,
    Sum(F2.TotalSizeKB) as TotalSize
    FROM
    FOLDERS F1 CROSS JOIN FOLDERS F2
    WHERE
    INSTR(F2.FOLDER, F1.FOLDER) > 0
    AND F2.FOLDER != F1.FOLDER
    GROUP BY
    F1.FOLDER
    ------------------------------------

    You see what that does, right? The combination of the cross join and
    the where clause ensures it finds and rolls up data for EVERY folder
    that has subfolders. All with one simple query. Nice! But it's too
    slow for large folder counts.

    I tried various iterations of instr(), substr() and where clauses, but
    nothing I tried made it speedy when there are lots of folders. I even
    wrote brute-force code to do a query per folder, but it wasn't nearly
    fast enough.

    Now put that compsci degree and those GuhNoo skillz to work and figure
    out how to fix this chokepoint!



    But in the
    real filesystem, it would be unheard of for *all* files to be present
    in *all* folders -- or indeed, for many files to be present in more
    than one folder.

    Also, I notice your database structure does not reflect the folder
    hierarchy -- where do you record parent-child relationships between
    folders?

    I didn't consider assigning parent-child, but I might try it and see if
    I can do anything useful with it.

    Meanwhile, run the program against one of your mid-size directories,
    then open the VIEW 'v_AllFolders' and you'll see it lists every
    directory in alpha order, so you can see the tree/ownership structure.

    I use and recommend the minimalist SQLiteStudio https://sqlitestudio.pl/


    In short, take more account of the actual filesystem hierarchy in your database structure.

    Not needed. The program and tables do exactly what I wanted: provide
    summary and detail info about your files and folders that's otherwise
    hard to get at.

    Thanks for looking at it.


    --- PyGate Linux v1.5.6
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)