Skip to content Skip to sidebar Skip to footer

Regex: Match Word With Intrusive Symbol

I'm trying to match all of the 'words' with an intrusive asterisk in it, including at the beginning and the end (but no other punctuation). For example, I'm expecting seven matche

Solution 1:

Thanks for editing in the expected output.

So, in addition to the excellent solution by @benvc, this one takes recursion into account so if you are looking to capture when the text has multiple *'s the entire found string will be captured and won't ignore other *'s

#Acting on your original text string>>> text = "star *tar s*ar st*r sta* (*tar) (sta*) sta*.">>> re.findall('((?:[a-z\*]*(?:\*)(?:[a-z\*]*)))+', text)
['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']



#Acting on a slightly **MORE COMPLEX** string and returning it accurately>>> text = "*tar *tar* star s*a**r *st*r* sta* (*tar) st*r** (sta**) s*ta*.">>> re.findall('((?:[a-z\*]*(?:\*)(?:[a-z\*]*)))+', text)
['*tar', '*tar*', 's*a**r', '*st*r*', 'sta*', '*tar', 'st*r**', 'sta**', 's*ta*']

.

Let me know if you want me to explain how this works if you might need it for future reference.

Solution 2:

You don't need the word boundaries with re.findall since it will find all the matches in a string for your specified regex. You also need to ensure that the match includes at least one word character so you don't match a single asterisk. For example:

import re

text = 'star *tar s*ar st*r sta* (*tar) (sta*) sta*.'

matches = re.findall(r'\w+\*\w*|\w*\*\w+', text)
print(matches)
# ['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']

Solution 3:

Try using this regex:

(\w*\*+\w*)+

First off, I suggest using an online tool to test your regexs like regexr.com.

Second, \b looks for a word boundary or the end of a word. What you want is the word character \w. The regex shown above finds either word characters or asterisks, then the + causes it to match entire words instead of just individual letters. Note that this cannot be the asterisk quantifier as each word must have at least one letter. Finally, the expression is wrapped in a capturing group for later use.

Python code:

import re

pattern = r”(\w*\*+\w*)+”
text = “star *tar s*ar st*r sta* (*tar) (sta*) sta*”
p = re.findall(pattern, text)

Edit: thanks to @benvc, I was able to update my expression to exclude ‘star’.

Solution 4:

You can try this one. It is even simpler.

import re

text = 'star *tar s*ar st*r sta* (*tar) (sta*) sta*.'

p = re.findall(r'[\w*]+', text)
print(p)

Output:

['star', '*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']

Post a Comment for "Regex: Match Word With Intrusive Symbol"