str 或位元組資料與 unicode 字元之間的轉換

Created: November-22, 2018

檔案和網路訊息的內容可以表示編碼字元。它們通常需要轉換為 unicode 才能正常顯示。

在 Python 2 中，你可能需要將 str 資料轉換為 Unicode 字元。預設值（''，"" 等）是一個 ASCII 字串，其中任何超出 ASCII 範圍的值都顯示為轉義值。Unicode 字串是 u''（或 u"" 等）。

Python 2.x >= 2.3

# You get "© abc" encoded in UTF-8 from a file, network, or other data source

s = '\xc2\xa9 abc'  # s is a byte array, not a string of characters
                    # Doesn't know the original was UTF-8
                    # Default form of string literals in Python 2
s[0]                # '\xc2' - meaningless byte (without context such as an encoding)
type(s)             # str - even though it's not a useful one w/o having a known encoding

u = s.decode('utf-8')  # u'\xa9 abc'
                       # Now we have a Unicode string, which can be read as UTF-8 and printed properly
                       # In Python 2, Unicode string literals need a leading u
                       # str.decode converts a string which may contain escaped bytes to a Unicode string
u[0]                # u'\xa9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u)             # unicode

u.encode('utf-8')   # '\xc2\xa9 abc'
                    # unicode.encode produces a string with escaped bytes for non-ASCII characters

在 Python 3 中，你可能需要將位元組陣列（稱為位元組文字）轉換為 Unicode 字串。預設值現在是一個 Unicode 字串，現在必須輸入 bytestring 文字作為 b''，b"" 等。位元組文字將返回 True 到 isinstance(some_val, byte)，假設 some_val 是一個可能被編碼為位元組的字串。

Python 3.x >= 3.0

# You get from file or network "© abc" encoded in UTF-8

s = b'\xc2\xa9 abc' # s is a byte array, not characters
                    # In Python 3, the default string literal is Unicode; byte array literals need a leading b
s[0]                # b'\xc2' - meaningless byte (without context such as an encoding)
type(s)             # bytes - now that byte arrays are explicit, Python can show that.

u = s.decode('utf-8')  # '© abc' on a Unicode terminal
                       # bytes.decode converts a byte array to a string (which will, in Python 3, be Unicode)
u[0]                # '\u00a9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u)             # str
                    # The default string literal in Python 3 is UTF-8 Unicode

u.encode('utf-8')   # b'\xc2\xa9 abc'
                    # str.encode produces a byte array, showing ASCII-range bytes as unescaped characters.