We can no longer rely on confidence from chardet since its always 1 with the move to the C based chardet library

So for files where we assume utf-8, use utf-8 if no explicit encoding is
found. Fixes #1993029 [Apostrophe in book title turns into "à€™" upon import](https://bugs.launchpad.net/calibre/+bug/1993029)
This commit is contained in:
Kovid Goyal 2022-10-15 18:02:11 +05:30
parent 74208b5330
commit ad34b0ea3b
No known key found for this signature in database
GPG Key ID: 06BC317B515ACE7C

View File

@ -154,6 +154,11 @@ def detect_xml_encoding(raw, verbose=False, assume_utf8=False):
encoding = encoding.decode('ascii', 'replace') encoding = encoding.decode('ascii', 'replace')
break break
if encoding is None: if encoding is None:
if assume_utf8:
try:
return raw.decode('utf-8'), 'utf-8'
except UnicodeDecodeError:
pass
encoding = force_encoding(raw, verbose, assume_utf8=assume_utf8) encoding = force_encoding(raw, verbose, assume_utf8=assume_utf8)
if encoding.lower().strip() == 'macintosh': if encoding.lower().strip() == 'macintosh':
encoding = 'mac-roman' encoding = 'mac-roman'