str 或位元組資料與 unicode 字元之間的轉換
檔案和網路訊息的內容可以表示編碼字元。它們通常需要轉換為 unicode 才能正常顯示。
在 Python 2 中,你可能需要將 str 資料轉換為 Unicode 字元。預設值(''
,""
等)是一個 ASCII 字串,其中任何超出 ASCII 範圍的值都顯示為轉義值。Unicode 字串是 u''
(或 u""
等)。
Python 2.x >= 2.3
# You get "© abc" encoded in UTF-8 from a file, network, or other data source
s = '\xc2\xa9 abc' # s is a byte array, not a string of characters
# Doesn't know the original was UTF-8
# Default form of string literals in Python 2
s[0] # '\xc2' - meaningless byte (without context such as an encoding)
type(s) # str - even though it's not a useful one w/o having a known encoding
u = s.decode('utf-8') # u'\xa9 abc'
# Now we have a Unicode string, which can be read as UTF-8 and printed properly
# In Python 2, Unicode string literals need a leading u
# str.decode converts a string which may contain escaped bytes to a Unicode string
u[0] # u'\xa9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u) # unicode
u.encode('utf-8') # '\xc2\xa9 abc'
# unicode.encode produces a string with escaped bytes for non-ASCII characters
在 Python 3 中,你可能需要將位元組陣列(稱為位元組文字)轉換為 Unicode 字串。預設值現在是一個 Unicode 字串,現在必須輸入 bytestring 文字作為 b''
,b""
等。位元組文字將返回 True
到 isinstance(some_val, byte)
,假設 some_val
是一個可能被編碼為位元組的字串。
Python 3.x >= 3.0
# You get from file or network "© abc" encoded in UTF-8
s = b'\xc2\xa9 abc' # s is a byte array, not characters
# In Python 3, the default string literal is Unicode; byte array literals need a leading b
s[0] # b'\xc2' - meaningless byte (without context such as an encoding)
type(s) # bytes - now that byte arrays are explicit, Python can show that.
u = s.decode('utf-8') # '© abc' on a Unicode terminal
# bytes.decode converts a byte array to a string (which will, in Python 3, be Unicode)
u[0] # '\u00a9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u) # str
# The default string literal in Python 3 is UTF-8 Unicode
u.encode('utf-8') # b'\xc2\xa9 abc'
# str.encode produces a byte array, showing ASCII-range bytes as unescaped characters.