Python Regex To Match Text In Single Quotes, Ignoring Escaped Quotes (and Tabs/newlines)
Solution 1:
This tested script should do the trick:
import re
re_sq_long = r"""
# Match single quoted string with escaped stuff.
' # Opening literal quote
( # $1: Capture string contents
[^'\\]* # Zero or more non-', non-backslash
(?: # "unroll-the-loop"!
\\. # Allow escaped anything.
[^'\\]* # Zero or more non-', non-backslash
)* # Finish {(special normal*)*} construct.
) # End $1: String contents.
' # Closing literal quote
"""
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
data = r'''
menu_item = 'casserole';
menu_item = 'meat
loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
match = re.sub('\s+', ' ', match) # Clean whitespace
match = re.sub(r'\\', '', match) # remove escapes
menu_items.append(match) # Add to menu listprint (menu_items)
Here is the short version of the regex:
'([^'\\]*(?:\\.[^'\\]*)*)'
This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.
Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):
'((?:[^'\\]|\\.)*)'
Solution 2:
This should do it:
menu_item = '((?:[^'\\]|\\')*)'
Here the (?:[^'\\]|\\')*
part matches any sequence of any character except '
and \
or a literal \'
. The former expression [^'\\]
does also allow line breaks and tabulators that you then need to replace by a single space.
Solution 3:
You cold try it like this:
pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)
It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.
Post a Comment for "Python Regex To Match Text In Single Quotes, Ignoring Escaped Quotes (and Tabs/newlines)"