Python Exception Thrown By Libtidy Is Amusingly Impossible To Catch
Solution 1:
I managed to reproduce the problem on Win (saved the HTML snippet in a file). Below is the last code variant.
code00.py:
#!/usr/bin/env pythonimport sys
import os
import threading
os.environ["PATH"] += os.pathsep + os.path.abspath(os.path.dirname(__file__)) # Built tidy.dll in the cwd, this is needed for it to be foundfrom tidylib import tidy_document
defmain(*argv):
print("main - TID: {0:d}".format(threading.get_ident()))
mode = "rb"
raw_content = open("content.html", mode=mode).read()
enc = "utf-8"iflen(sys.argv) < 2else sys.argv[1]
html_content = raw_content.decode(enc)
print(html_content.encode(enc) == raw_content)
withopen("content_utf8.html", "w", encoding=enc) as fout:
fout.write(html_content)
try:
xhtml_doc, errors = tidy_document(html_content)
except UnicodeDecodeError as ude:
print("Caught the exception:", ude)
except UnicodeError as ue:
print("Caught the exception:", ue)
except Exception as ex:
print("Caught the exception:", ex)
except:
print("Caught an exception")
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64if sys.maxsize > 0x100000000else32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Output:
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q059054833]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 64bit on win32 main - TID: 9528 True Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940> Traceback (most recent call last): File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte write_func(byte.decode('utf-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940> Traceback (most recent call last): File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte write_func(byte.decode('utf-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940> Traceback (most recent call last): File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte write_func(byte.decode('utf-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte Done.
I tested (temporarily modified sink.py), and they are indeed in the same thread. Then, I looked more closely at the stacktrace, and figured it out:
- PyTidyLib calls some C code from the backend Tidy library (tidy.dll), via CTypes
- The (above) C code calls some Python code (Sink.put_byte), as a callback that was passed to it together with the arguments
- The (Python) code from previous step raises an exception, but the underlying C code (that calls it) doesn't "know" how pass it back to #1., as it has no Python "knowledge" whatsoever (so the exception "dies" there)
That's why you couldn't catch it in Python.
I tried reading the files with different other encodings, but no luck. Then I did some more debugging, and it seems like there are 3 invalid UTF-8 characters (\x07, \xAA, \xB6 - when combined with other ones) in your file. Of course, trying to decode an UTF-8 character out of a single byte seems strange to me, but that might be a PyTidyLib bug.
Update #0
Since I had to build tidy.dll (as I didn't want to start LnxVMs or install the .whl under Cygwin) to do all the tests, I also uploaded it (and other artifacts) to [GitHub]: CristiFati/Prebuilt-Binaries - Prebuilt-Binaries/HTML-Tidy/v5.7.28.
Post a Comment for "Python Exception Thrown By Libtidy Is Amusingly Impossible To Catch"