str 或字节数据与 unicode 字符之间的转换
文件和网络消息的内容可以表示编码字符。它们通常需要转换为 unicode 才能正常显示。
在 Python 2 中,你可能需要将 str 数据转换为 Unicode 字符。默认值(''
,""
等)是一个 ASCII 字符串,其中任何超出 ASCII 范围的值都显示为转义值。Unicode 字符串是 u''
(或 u""
等)。
Python 2.x >= 2.3
# You get "© abc" encoded in UTF-8 from a file, network, or other data source
s = '\xc2\xa9 abc' # s is a byte array, not a string of characters
# Doesn't know the original was UTF-8
# Default form of string literals in Python 2
s[0] # '\xc2' - meaningless byte (without context such as an encoding)
type(s) # str - even though it's not a useful one w/o having a known encoding
u = s.decode('utf-8') # u'\xa9 abc'
# Now we have a Unicode string, which can be read as UTF-8 and printed properly
# In Python 2, Unicode string literals need a leading u
# str.decode converts a string which may contain escaped bytes to a Unicode string
u[0] # u'\xa9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u) # unicode
u.encode('utf-8') # '\xc2\xa9 abc'
# unicode.encode produces a string with escaped bytes for non-ASCII characters
在 Python 3 中,你可能需要将字节数组(称为字节文字)转换为 Unicode 字符串。默认值现在是一个 Unicode 字符串,现在必须输入 bytestring 文字作为 b''
,b""
等。字节文字将返回 True
到 isinstance(some_val, byte)
,假设 some_val
是一个可能被编码为字节的字符串。
Python 3.x >= 3.0
# You get from file or network "© abc" encoded in UTF-8
s = b'\xc2\xa9 abc' # s is a byte array, not characters
# In Python 3, the default string literal is Unicode; byte array literals need a leading b
s[0] # b'\xc2' - meaningless byte (without context such as an encoding)
type(s) # bytes - now that byte arrays are explicit, Python can show that.
u = s.decode('utf-8') # '© abc' on a Unicode terminal
# bytes.decode converts a byte array to a string (which will, in Python 3, be Unicode)
u[0] # '\u00a9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u) # str
# The default string literal in Python 3 is UTF-8 Unicode
u.encode('utf-8') # b'\xc2\xa9 abc'
# str.encode produces a byte array, showing ASCII-range bytes as unescaped characters.