Skip to content Skip to sidebar Skip to footer

Python Regex To Match Text In Single Quotes, Ignoring Escaped Quotes (and Tabs/newlines)

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline

Solution 1:

This tested script should do the trick:

import re
re_sq_long = r"""
    # Match single quoted string with escaped stuff.
    '            # Opening literal quote
    (            # $1: Capture string contents
      [^'\\]*    # Zero or more non-', non-backslash
      (?:        # "unroll-the-loop"!
        \\.      # Allow escaped anything.
        [^'\\]*  # Zero or more non-', non-backslash
      )*         # Finish {(special normal*)*} construct.
    )            # End $1: String contents.
    '            # Closing literal quote
    """
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"

data = r'''
        menu_item = 'casserole';
        menu_item = 'meat 
                    loaf';
        menu_item = 'Tony\'s magic pizza';
        menu_item = 'hamburger';
        menu_item = 'Dave\'s famous pizza';
        menu_item = 'Dave\'s lesser-known
            gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
    match = re.sub('\s+', ' ', match) # Clean whitespace
    match = re.sub(r'\\', '', match)  # remove escapes
    menu_items.append(match)          # Add to menu listprint (menu_items)

Here is the short version of the regex:

'([^'\\]*(?:\\.[^'\\]*)*)'

This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.

Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

'((?:[^'\\]|\\.)*)'

Solution 2:

This should do it:

menu_item = '((?:[^'\\]|\\')*)'

Here the (?:[^'\\]|\\')* part matches any sequence of any character except ' and \ or a literal \'. The former expression [^'\\] does also allow line breaks and tabulators that you then need to replace by a single space.

Solution 3:

You cold try it like this:

pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)

It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.

Post a Comment for "Python Regex To Match Text In Single Quotes, Ignoring Escaped Quotes (and Tabs/newlines)"