Forum: d0p3 BBS

Correct syntax for pathological re.search()

From Michael F. Stemper@3:633/280.2 to All on Tue Oct 8 00:35:32 2024

I'm trying to discard lines that include the string "\sout{" (which is TeX, for those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

But the lines with that string keep coming through. What is the right syntax to properly escape the backslash and the left curly bracket?

--
Michael F. Stemper
No animals were harmed in the composition of this message.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: A noiseless patient Spider (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Tue Oct 8 00:56:51 2024

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

if not re.search("\\sout\{", line):

So, if you're not down to slap an "r" before your string literals,
you're going to end up doubling down on every backslash.

Long story short, those double backslashes in your regex?
They'll be quadrupling up in your Python string literal!

main.py

import re

lines = r'''
abcdef
\sout{abcdef
abcdef
abc\sout{def
abcdef
abcdef\sout{
abcdef
'''.strip().split( '\n' )

for line in lines:
product = re.search( "\\\\sout\\{", line )
if not product:
print( line )

stdout

abcdef
abcdef
abcdef
abcdef

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

From Michael F. Stemper@3:633/280.2 to All on Tue Oct 8 01:14:53 2024

On 07/10/2024 08.56, Stefan Ram wrote:

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

if not re.search("\\sout\{", line):

So, if you're not down to slap an "r" before your string literals,
you're going to end up doubling down on every backslash.

Never heard of that before, but it did the trick.

Long story short, those double backslashes in your regex?
They'll be quadrupling up in your Python string literal!

for line in lines:
product = re.search( "\\\\sout\\{", line )

This also worked.

For now, I'll use the "r" in a cargo-cult fashion, until I decide which
syntax I prefer. (Is there any reason that one or the other is preferable?)

Thanks for your help,
Mike
--
Michael F. Stemper
Economists have correctly predicted seven of the last three recessions.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: A noiseless patient Spider (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Tue Oct 8 01:32:06 2024

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

For now, I'll use the "r" in a cargo-cult fashion, until I decide which >syntax I prefer. (Is there any reason that one or the other is preferable?)

I'd totally go with the r-style notation!

It's got one bummer though - you can't end such a string literal with
a backslash. But hey, no biggie, you could use one of those notations:

main.py

path = r'C:\Windows\example' + '\\'

print( path )

path = r'''
C:\Windows\example\
'''.strip()

print( path )

stdout

C:\Windows\example\
C:\Windows\example\

.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

From Jon Ribbens@3:633/280.2 to All on Tue Oct 8 02:43:59 2024

On 2024-10-07, Stefan Ram <ram@zedat.fu-berlin.de> wrote:

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

For now, I'll use the "r" in a cargo-cult fashion, until I decide which >>syntax I prefer. (Is there any reason that one or the other is preferable?)

I'd totally go with the r-style notation!

It's got one bummer though - you can't end such a string literal with
a backslash. But hey, no biggie, you could use one of those notations:

main.py

path = r'C:\Windows\example' + '\\'

print( path )

path = r'''
C:\Windows\example\
'''.strip()

print( path )

stdout

C:\Windows\example\
C:\Windows\example\

.

.... although of course in this example you should probably do neither of
those things, and instead do:

from pathlib import Path
path = Path(r'C:\Windows\example')

since in a Path the trailing '\' or '/' is unnecessary. Which leaves
very few remaining uses for a raw-string with a trailing '\'...

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: A noiseless patient Spider (3:633/280.2@fidonet)

From Pieter van Oostrum@3:633/280.2 to All on Wed Oct 9 04:50:14 2024

ram@zedat.fu-berlin.de (Stefan Ram) writes:

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

path = r'C:\Windows\example' + '\\'

You could even omit the '+'. Then the concatenation is done at parsing time instead of run time.
--
Pieter van Oostrum <pieter@vanoostrum.org>
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Karsten Hilbert@3:633/280.2 to All on Wed Oct 9 05:30:34 2024

Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Py= thon-list:

I'm trying to discard lines that include the string "\sout{" (which is T=

eX, for

those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

unwanted_tex =3D '\sout{'
if unwanted_tex not in line: do_something_with_libreoffice()

Karsten
=2D-
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From MRAB@3:633/280.2 to All on Wed Oct 9 06:07:04 2024

On 2024-10-08 19:30, Karsten Hilbert via Python-list wrote:

Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

I'm trying to discard lines that include the string "\sout{" (which is TeX, for
those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

unwanted_tex = '\sout{'
if unwanted_tex not in line: do_something_with_libreoffice()

That should be:

unwanted_tex = r'\sout{'

or:

unwanted_tex = '\\sout{'

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From MRAB@3:633/280.2 to All on Wed Oct 9 06:11:40 2024

On 2024-10-07 14:35, Michael F. Stemper via Python-list wrote:

I'm trying to discard lines that include the string "\sout{" (which is TeX, for
those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

But the lines with that string keep coming through. What is the right syntax to
properly escape the backslash and the left curly bracket?

String literals use backslash is an escape character, so it needs to be escaped, or you need to use a "raw" string.

However, regex also uses backslash as an escape character.

That means that a literal backslash in a regex that's in a plain string literal needs to be doubly-escaped, once for the string literal and
again for the regex.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Wed Oct 9 06:32:04 2024

MRAB <python@mrabarnett.plus.com> wrote or quoted:

However, regex also uses backslash as an escape character.

TeX also uses the backslash as an escape character:

\chardef \\ = '\\

, the regular expression to search exactly this:

\\chardef \\\\ = '\\\\

, and the Python string literal for that regular expression:

"\\\\chardef \\\\\\\\ = '\\\\\\\\".

. Must be a reason Markdown started to use the backtick!

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Wed Oct 9 06:57:45 2024

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:

"\\\\chardef \\\\\\\\ = '\\\\\\\\".

However, one can rewrite this as follows:

"`chardef `` = '``".replace( "`", "\\"*4 )

. One can also use "repr" to find how to represent something:

main.py

text = input( "What do you want me to represent as a literal? " )
print( repr( text ))

transcript

What do you want me to represent as a literal? \\sout\{
'\\\\sout\\{'

. We can use "escape" and "repr" to find how to represent
a regular expression for a literal text:

main.py

import re

text = input( "Want the literal of an re for what text? " )
print( repr( re.escape( text )))

transcript

Want the literal of an re for what text? \sout{
'\\\\sout\\{'

.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

From Karsten Hilbert@3:633/280.2 to All on Wed Oct 9 07:17:49 2024

Am Tue, Oct 08, 2024 at 08:07:04PM +0100 schrieb MRAB via Python-list:

unwanted_tex =3D '\sout{'
if unwanted_tex not in line: do_something_with_libreoffice()

That should be:

unwanted_tex =3D r'\sout{'

Hm.

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

tex =3D '\sout{'
tex

'\\sout{'

Am I missing something ?

Karsten
=2D-
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Alan Bawden@3:633/280.2 to All on Wed Oct 9 07:59:48 2024

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
>>> tex
'\\sout{'
>>>

Am I missing something ?

You're missing the warning it generates:

> python -E -Wonce
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
<stdin>:1: DeprecationWarning: invalid escape sequence '\s'
>>>

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ITS Preservation Society (3:633/280.2@fidonet)

From MRAB@3:633/280.2 to All on Wed Oct 9 09:10:03 2024

On 2024-10-08 21:59, Alan Bawden via Python-list wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
>>> tex
'\\sout{'
>>>

Am I missing something ?

You're missing the warning it generates:

> python -E -Wonce
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
<stdin>:1: DeprecationWarning: invalid escape sequence '\s'
>>>

You got lucky that \s in invalid. If it had been \t you would've got a
tab character.

Historically, Python treated invalid escape sequences as literals, but
it's deprecated now and will become an outright error in the future
(probably) because it often hides a mistake, such as the aforementioned
\t being treated as a tab character when the user expected it to be a
literal backslash followed by letter t. (This can occur within Windows
file paths written in plain string literals.)

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Karsten Hilbert@3:633/280.2 to All on Thu Oct 10 05:06:10 2024

Am Tue, Oct 08, 2024 at 04:59:48PM -0400 schrieb Alan Bawden via Python-li=
st:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on l=

inux

Type "help", "copyright", "credits" or "license" for more inf=

ormation.

>>> tex =3D '\sout{'
>>> tex
'\\sout{'
>>>

Am I missing something ?

You're missing the warning it generates:

<stdin>:1: DeprecationWarning: invalid escape sequence '\s'

I knew it'd be good to ask :-D

Karsten
=2D-
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Gilmeh Serda@3:633/280.2 to All on Sat Oct 12 01:43:56 2024

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
if not re.search("\sout{", line): if not re.search("\sout\{", line):
if not re.search("\\sout{", line): if not re.search("\\sout\{",
line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux
Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--
Gilmeh

Sometimes I simply feel that the whole world is a cigarette and I'm the
only ashtray.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Easynews - www.easynews.com (3:633/280.2@fidonet)

From avi.e.gross@gmail.com@3:633/280.2 to All on Sat Oct 12 08:13:07 2024

Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

Obviously, life is not that simple as it can go through multiple layers with each dealing with a layer of backslashes.

But for simple cases, ...

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Gilmeh Serda via Python-list
Sent: Friday, October 11, 2024 10:44 AM
To: python-list@python.org
Subject: Re: Correct syntax for pathological re.search()

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
if not re.search("\sout{", line): if not re.search("\sout\{", line):
if not re.search("\\sout{", line): if not re.search("\\sout\{",
line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux
Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--
Gilmeh

Sometimes I simply feel that the whole world is a cigarette and I'm the
only ashtray.
--
https://mail.python.org/mailman/listinfo/python-list

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From MRAB@3:633/280.2 to All on Sat Oct 12 11:37:55 2024

On 2024-10-11 22:13, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

Obviously, life is not that simple as it can go through multiple layers with each dealing with a layer of backslashes.

But for simple cases, ...

Yes. It's called 'print'. :-)

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Gilmeh Serda via Python-list
Sent: Friday, October 11, 2024 10:44 AM
To: python-list@python.org
Subject: Re: Correct syntax for pathological re.search()

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
if not re.search("\sout{", line): if not re.search("\sout\{", line):
if not re.search("\\sout{", line): if not re.search("\\sout\{",
line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Peter J. Holzer@3:633/280.2 to All on Sat Oct 12 21:59:58 2024

--j5p3p5bfujs6sx6l
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what =

the

regular expression you typed in will look like by the time it is ready to=

be

used?

I assume that by "ready to be used" you mean the compiled form?

No, there doesn't seem to be a way to dump that. You can

p =3D re.compile("\\\\sout{")
print(p.pattern)

but that just prints the input string, which you could do without
compiling it first.

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user. It's probably some
kind of state machine, and a large table of state transitions isn't very readable.

There are a number of websites which visualize regular expressions.
Those are probably better for debugging a regular expression than
anything the re module could reasonably produce (although with the
caveat that such a web site would use a different implementation and
therefore might produce different results).

hp

--=20
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--j5p3p5bfujs6sx6l
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcKVqgACgkQ8g5IURL+ KF3UaxAAgcW8z/AEatfZ8rsmA3Xw2TLE6Uc/33Em3+4iUHNNaDZwYEYCs9InNviq kmChc5eqpvbdbHwzSo6nMRUKoWIff8LWYTSjvk0/eFTngzP5nS87cUgCKqd2AlFr oH+tmxtRBShy6gJw8Zp9nZo4eMyk2jDrAkWrPRnM78WJ1XR6EgQ/xQtNEfLJDnEv wLczdzhg9Q5yxAZkcx/+NMf1kCtkSszR2f05lglLgoKhcMK45d71XWtRJaVpOHY4 y/k+avJT7I7OTTR0rEdCJ9Plb6z9tEtkcsSOD6Nk2CyaTt/UNcrRLN/oE6EKZmnm YWnmkTMKtVhM8LLi7/KzThkY8celLwEfDdh7yvZKh1pcVabf+YvSY/A/MxBwDpQS G6xbugimLDv4eY8dYtjgC3E3UYlpELOb4hfMbrJ9sbXKevLUh5HwQGLDY+psJsYx FRtACWb/MLmj8SaFFFe60DUigx6JLEJCPLanAtuo+PRIfigDRtnbSP4awctvULRY Q5bjftnbLiR7ZUvuZTaRxF2vHBV4a2EQCGIbzqzDoM5bt1cMjj60H+VKG4+2QeYR +x4lj/7gKywo9aVpPFT9ppfLR2Dyd9wZielnRBAc6QckoYVBow3eGIKZRq8F16gO kmibo1lUfcc+rW2n49dKhRme8T6eZ8yzPJUntnIgOfUkItQ+Y8M=
=K5ph
-----END PGP SIGNATURE-----

--j5p3p5bfujs6sx6l--

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Sat Oct 12 22:59:52 2024

"Peter J. Holzer" <hjp-python@hjp.at> wrote or quoted:

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user.

So, what he might be getting at with "compiled form" is a
representation that's easy on the eyes for us mere mortals.

You could, for instance, use colors to show the difference between
object and meta characters. In that case, the regex "\**" would
come out as "**", but the first "*" might be navy blue (on a white
background), so just your run-of-the-mill object character, while
the second one would be burgundy, flagging it as a meta character.

So, simplified, that would be something like:

import re
import tkinter as tk
import time

def tokenize_regex( pattern ):
tokens = []
i = 0
while i < len( pattern ):
if pattern[ i ] == '\\':
if i + 1 < len( pattern ):
tokens.append( ( 'escaped', pattern[ i+1: i+2 ]))
i += 2
else:
tokens.append( ('error', 'Incomplete escape sequence' ))
i += 1
elif pattern[i] == '*':
tokens.append( ( 'repetition', '*' ))
i += 1
else:
tokens.append( ( 'plain', pattern[ i ]))
i += 1

return tokens

root = tk.Tk()
root.configure( bg='white' )

regex = r'\**'
result = tokenize_regex( regex )

for token_type, token_value in result:
if token_type == 'plain' or token_type == 'escaped':
tk.Label( root, text=token_value, font=( 'Arial', 40 ), fg='#4070FF', bg='white' ).pack( side='left' )
elif token_type == 'repetition':
tk.Label( root, text=token_value, font=( 'Arial', 40 ), fg='#C02000', bg='white' ).pack( side='left' )

root.mainloop()

.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

From Thomas Passin@3:633/280.2 to All on Sat Oct 12 23:51:57 2024

On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what the >> regular expression you typed in will look like by the time it is ready to be >> used?

I assume that by "ready to be used" you mean the compiled form?

No, there doesn't seem to be a way to dump that. You can

p = re.compile("\\\\sout{")
print(p.pattern)

but that just prints the input string, which you could do without
compiling it first.

It prints the escaped version, so you can see if you escaped the string
as you intended. In this case, the print will display '\\sout{'. That's
worth something.

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user. It's probably some
kind of state machine, and a large table of state transitions isn't very readable.

There are a number of websites which visualize regular expressions.
Those are probably better for debugging a regular expression than
anything the re module could reasonably produce (although with the
caveat that such a web site would use a different implementation and therefore might produce different results).

hp

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From avi.e.gross@gmail.com@3:633/280.2 to All on Sun Oct 13 01:10:41 2024

Peter,

Matthew understood what I was hinting at in one way and you in another.

The question asked how to add some power of two backslashes or make other changes, so the RE functionality sees what you want. The goal is to see what happens when one or more intermediate evaluations may change the string.

So, a simple print may suffice as a parallel way to force the same
evaluations.

Thomas made his point. And, I am starting to feel like I need to change my
name to something like Luke since this discussion must be gospel.

FYI, I was not planning on posting at all. Time to detach.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer via Python-list
Sent: Saturday, October 12, 2024 7:00 AM
To: python-list@python.org
Subject: Re: Correct syntax for pathological re.search()

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what

the

regular expression you typed in will look like by the time it is ready to

be

used?

I assume that by "ready to be used" you mean the compiled form?

No, there doesn't seem to be a way to dump that. You can

p = re.compile("\\\\sout{")
print(p.pattern)

but that just prints the input string, which you could do without
compiling it first.

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user. It's probably some
kind of state machine, and a large table of state transitions isn't very readable.

There are a number of websites which visualize regular expressions.
Those are probably better for debugging a regular expression than
anything the re module could reasonably produce (although with the
caveat that such a web site would use a different implementation and
therefore might produce different results).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Thomas Passin@3:633/280.2 to All on Sun Oct 13 00:06:54 2024

On 10/11/2024 8:37 PM, MRAB via Python-list wrote:

On 2024-10-11 22:13, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show
what the
regular expression you typed in will look like by the time it is ready
to be
used?

Obviously, life is not that simple as it can go through multiple
layers with
each dealing with a layer of backslashes.

But for simple cases, ...

Yes. It's called 'print'. :-)

There is section in the Python docs about this backslash subject. It's
titled "The Backslash Plague" in

https://docs.python.org/3/howto/regex.html

You can also inspect the compiled expression to see what string it
received after all the escaping:

import re

re_string = '\\w+\\\\sub'
re_pattern = re.compile(re_string)

# Should look as if we had used r'\w+\\sub'
print(re_pattern.pattern)

\w+\\sub

-----Original Message-----
From: Python-list <python-list-
bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Gilmeh Serda via Python-list
Sent: Friday, October 11, 2024 10:44 AM
To: python-list@python.org
Subject: Re: Correct syntax for pathological re.search()

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
�� if not re.search("\sout{", line): if not re.search("\sout\{", line):
�� if not re.search("\\sout{", line): if not re.search("\\sout\{",
�� line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep� 8 2024, 13:18:56) [GCC 14.2.1 20240805] on
linux
Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes
care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Sun Oct 13 21:45:44 2024

Gilmeh Serda <gilmeh.serda@nothing.here.invalid> wrote or quoted:

You don't need to escape the curly braces.

Here's the 411 on some gnarly regex characters:

.. matches any single character, except when it hits a new line
^ kicks things off at the start of the sequence
$ wraps it up at the end
* goes zero to infinity
+ one or more times
? maybe once, maybe not
{ starts a specific count, like {2} or {2,3}
} ends such a count
| either this or that
\ flips the script on the next character's meaning
( drops in on a group
) bails out of the group
[ paddles out to a character class
] rides the character class to shore

.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

From Peter J. Holzer@3:633/280.2 to All on Sat Oct 19 08:09:41 2024

--jcuygm3ttbtxcbci
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2024-10-12 08:51:57 -0400, Thomas Passin via Python-list wrote:

On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show w=

hat the

regular expression you typed in will look like by the time it is read=

y to be

used?

=20
I assume that by "ready to be used" you mean the compiled form?
=20
No, there doesn't seem to be a way to dump that. You can
=20
p =3D re.compile("\\\\sout{")
print(p.pattern)
=20
but that just prints the input string, which you could do without
compiling it first.

=20
It prints the escaped version,

Did you mean the *un*escaped version? Well, yeah, that's what print
does.

so you can see if you escaped the string as you intended. In this
case, the print will display '\\sout{'.

print("\\\\sout{")
will do the same.

It seems to me that for any string s which is a valid regular expression
(i.e. re.compile doesn't throw an exception)

assert re.compile(s).pattern =3D=3D s

holds.

So it doesn't give you anything you didn't already know.

As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
are equivalent (the \ before the { is redundant). Yet
re.compile(s).pattern preserves the difference between the two strings.

hp

--=20
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--jcuygm3ttbtxcbci
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcSzo0ACgkQ8g5IURL+ KF20bg//YkA5go+I97KeDcgF5HF/zFVmsfGJar8yPBWy9RLCmZDzjKx336GCKVbo 20N7AAXrgkTyh9uUOFaTp1J0uokntWjUPLSJKMGmfleLHYFJfbBFBDtt2HlGjCpV O7QFqBH0NsmGIh3zh1ZXn4k+GUnChOJia3AeJTRJynlm4ISB5gHqp/UUj+5NSW8T D8GFQW1b2qgzuU49paKuau2qun6j+Fk6gKNIoFGM1VsGQDuxnJ47nGFrB1ntcyH4 F72Ln4GQPEeEqWO8Zyo1lle29G11bxDJ9G73xrIrDj8YEdUm5wGkdwMlGBi8MiXR PWvpoRC84K9lKGrcZKqgxu+BCUcz2AtPO1rNYduFSm6qh5kjpScAfwqdDTfiW8kf nyjddWwq0i1FMjJ9YBJ0FQ5pQAJIvsHIZs+fPnB1cmJi1CnBjDCafBzbzT8W48AN klcDwAOQJoci1GphWut5/NKuk/tbqY7CiEsYbs6sCi6omIo5fQG/rnweAkP004Ar 7vtJXgc/X/DJr29Zg4Kh88/1MJk9AkKgZGjpD0OYZVFN1cuMqJYzjYsK7L1DaGMP aomO4/vX82pfvbs7IkUfK6LJYsHt+ww39iiBAAOiEwaCVH68oGPlfdLnYdpQEAqn Ls1SMKu+UijM6ClZf6Krng9aIElAMYyC2Rr0qAApzVJmEPFXke4=
=JOFL
-----END PGP SIGNATURE-----

--jcuygm3ttbtxcbci--

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From jak@3:633/280.2 to All on Sat Oct 19 09:15:23 2024

Peter J. Holzer ha scritto:

As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
are equivalent (the \ before the { is redundant). Yet
re.compile(s).pattern preserves the difference between the two strings.

Hi,
Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent. If you omit the backslash, the parser will have to determine
if the graph is part of regular expression {n, m} and will take more
time. In some online regexs have these results:

r"\\sout{" : 1 match ( 7 steps, 620 μs )

r"\\sout\{" : 1 match ( 7 steps, 360 μs )

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: A noiseless patient Spider (3:633/280.2@fidonet)

From Peter J. Holzer@3:633/280.2 to All on Tue Oct 22 06:10:49 2024

--xqh3xvxqda6sjljl
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2024-10-19 00:15:23 +0200, jak via Python-list wrote:

Peter J. Holzer ha scritto:

As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{" are equivalent (the \ before the { is redundant). Yet
re.compile(s).pattern preserves the difference between the two strings.

=20
Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent.

They are. Both will match the 6 character string=20
0005c \ REVERSE SOLIDUS
00073 s LATIN SMALL LETTER S
0006f o LATIN SMALL LETTER O
00075 u LATIN SMALL LETTER U
00074 t LATIN SMALL LETTER T
0007b { LEFT CURLY BRACKET

If you omit the backslash, the parser will have to determine if the
graph is part of regular expression {n, m} and will take more time.

Yes, that's the parser. But the result of parsing will be the same:
The string will end in a literal backslash.

hp

--=20
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--xqh3xvxqda6sjljl
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcWpzMACgkQ8g5IURL+ KF3MwA//U/od9Ba5s4anGCrIUm7uGFdaBVW7/SWr/xgGdJ2flTGfBnYPvVrK30yn 2B/GBBEn81clGhfvy2Ec7BHNl6Jz2bhe3kLzTAHCMANFXK4SwgNNM1lPLQe5MZr9 hWNUnl7fneAOmS6ySacAiI5L1cVtbWRy7D6/kjTcCdD1HdLMlY/hh6WA6Wxo/cfL b12WvPToolCd1QozzoQxgHpvqMgYq9i0vfycgavB0OG2QQlAwD5KkYBfGKqoFGoo X8TJqzH86Ofkln1RKKe+hixhvGU7Ce40H7UpECAMFMzJvXdaVzqHXCceY5f/ma0a 3PTKTia7df/1p3b47PXwDsaU3wTuAxexwNypHDn+FYmRHIjCX29oeANNOzC12/gI ToLiitDTnwR9h3n0hKgxpL2GDkHQscoLkRJWSirzFaTzwI+u/X9OLP8xj9AlaWHm WCtOisIJsva0JHoHKyG+Ycuqvgki0H4ZHc2MD3h7cx8hcodzNVI8OGFeBDpGCGqh rFfTn6pb4hF3Oc5T5qN/+eZGHuYIOoRA7/JL0Ou3XpJi4iDSCRofkGMMpUzRYi9o fa/EC9Hre0uG9j9fWlltpCcwjfnzOCPVn5Loqai2kaxJJhf7bNit5G97Lq4mCQi8 ZM0YwF+14JDKKD7gZ5qbqvEjF3ofdjMqNNoqwcSW7GObJdluQ/A=
=Aw/7
-----END PGP SIGNATURE-----

--xqh3xvxqda6sjljl--

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)

From Stefan Ram@3:633/280.2 to All on Tue Oct 22 07:24:49 2024

"Peter J. Holzer" <hjp-python@hjp.at> wrote or quoted:

On 2024-10-19 00:15:23 +0200, jak via Python-list wrote:

Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not >>equivalent.

.. . .

Yes, that's the parser. But the result of parsing will be the same:
The string will end in a literal backslash.

Functional reqs lay out what your system's got to do, while
non-functional reqs are all about time and other resource
constraints.

When you're crunching through parsing, what pops out is
your functional bread and butter.

But the time it takes to chew through that data?
That's non-functional and implementation-dependent territory.

So, we can say they're functionally equivalent.

--- MBSE BBS v1.0.8.4 (Linux-x86_64)
* Origin: Stefan Ram (3:633/280.2@fidonet)

Who's Online
Recent Visitors
- Hiro
  Wed May 21 00:36:55 2025
  from Pennsylvania via Telnet
- Guest
  Wed Jun 4 14:17:37 2025
  from Sadf via Telnet
- Guest
  Wed Jun 4 14:05:02 2025
  from Lijhjkl via Telnet
- Guest
  Thu Jul 10 11:08:37 2025
  from Dallas, Tx via Telnet

System Info

Sysop:	Tetrazocine
Location:	Melbourne, VIC, Australia
Users:	9
Nodes:	8 (0 / 8)
Uptime:	89:55:11
Calls:	162
Files:	21,502
Messages:	77,823

Correct syntax for pathological re.search()

Who's Online

Recent Visitors

System Info