The Output Of Re.split In Python Doesn't Make Sense To Me
Solution 1:
The pattern that split the string are turned to
''
in the output list sometimes, but disappear other time.
No, the pattern (or what it matched) is never included in your outputs there. Those ''
are what's between the matches. Because that's what re.split
does. Your example:
>>> re.split(r'\d','sdfsfdsfds123212fdsf2')
['sdfsfdsfds', '', '', '', '', '', 'fdsf', '']
You're splitting by digits, and the substring '123212'
has six digits, so there are five empty strings between them. That's why there are five empty strings in the output there.
Solution 2:
The output isn't weird, it's intentional. From the docs:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
That way, separator components are always found at the same relative indices within the result list.
Emphasis added to point out why this is done. The same applies to "empty" sequences inside the string and non-capturing separators. Basically, there is content before and after a separator - even if the separator is not captured and either content is empty. The similar method str.split
actually does the same.
This allows you to always reconstruct the initial string if you know the separator. Capturing the separator and joining, or inserting the separator on joining is equivalent. ''.join(re.split('(%s)' % sep, ':::words::words:::')) == sep.join(re.split('%s' % sep, ':::words::words:::'))
Solution 3:
First of all, you're essentially providing the maxsplit=10
argument instead of flags=re.I|re.
Secondly, the separators are not turned into ''
; instead that is the string between the separators:
>>> re.split(r':', 'foo:bar::baz:')
['foo', 'bar', '', 'baz', '']
Notice the ''
between 2 separators, and also at the end.
The separators themselves are not in the result, unless your regular expression contains capturing groups ((...)
):
>>> re.split(r'(:)', 'foo:bar::baz:')
['foo', ':', 'bar', ':', '', ':', 'baz', ':', '']
Third: even though r'\d*'
would ordinarily match at the beginning of a string, end of string, and between each character, currently only non-zero-length matches are considered by re.split
, thus that pattern behaving like r\d+
. However such behaviour is subject to change in Python 3.6 and later, and emits a warning FutureWarning: split() requires a non-empty pattern match.
on Python 3.5.
Post a Comment for "The Output Of Re.split In Python Doesn't Make Sense To Me"