Regex: Match Word With Intrusive Symbol
Solution 1:
Thanks for editing in the expected output.
So, in addition to the excellent solution by @benvc, this one takes recursion into account so if you are looking to capture when the text has multiple *
's the entire found string will be captured and won't ignore other *
's
#Acting on your original text string>>> text = "star *tar s*ar st*r sta* (*tar) (sta*) sta*.">>> re.findall('((?:[a-z\*]*(?:\*)(?:[a-z\*]*)))+', text)
['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']
#Acting on a slightly **MORE COMPLEX** string and returning it accurately>>> text = "*tar *tar* star s*a**r *st*r* sta* (*tar) st*r** (sta**) s*ta*.">>> re.findall('((?:[a-z\*]*(?:\*)(?:[a-z\*]*)))+', text)
['*tar', '*tar*', 's*a**r', '*st*r*', 'sta*', '*tar', 'st*r**', 'sta**', 's*ta*']
.
Let me know if you want me to explain how this works if you might need it for future reference.
Solution 2:
You don't need the word boundaries with re.findall
since it will find all the matches in a string for your specified regex. You also need to ensure that the match includes at least one word character so you don't match a single asterisk. For example:
import re
text = 'star *tar s*ar st*r sta* (*tar) (sta*) sta*.'
matches = re.findall(r'\w+\*\w*|\w*\*\w+', text)
print(matches)
# ['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']
Solution 3:
Try using this regex:
(\w*\*+\w*)+
First off, I suggest using an online tool to test your regexs like regexr.com.
Second, \b looks for a word boundary or the end of a word. What you want is the word character \w. The regex shown above finds either word characters or asterisks, then the + causes it to match entire words instead of just individual letters. Note that this cannot be the asterisk quantifier as each word must have at least one letter. Finally, the expression is wrapped in a capturing group for later use.
Python code:
import re
pattern = r”(\w*\*+\w*)+”
text = “star *tar s*ar st*r sta* (*tar) (sta*) sta*”
p = re.findall(pattern, text)
Edit: thanks to @benvc, I was able to update my expression to exclude ‘star’.
Solution 4:
You can try this one. It is even simpler.
import re
text = 'star *tar s*ar st*r sta* (*tar) (sta*) sta*.'
p = re.findall(r'[\w*]+', text)
print(p)
Output:
['star', '*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']
Post a Comment for "Regex: Match Word With Intrusive Symbol"