str, bytes, unicode

Python2 / Python3

str 在 Python2 和 Python3 中说的其实不是同一个东西。

str 是用 '' 或者 "" 括起来。

unicode 是 '' 或者 "" 前面还有一个 u 前缀。

bytes 是 '' 或者 "" 前面还有一个 b 前缀。

在 Python2 里面, 普通的字符串的类型就是 str, 它是字符的序列, 字符包含在引号内。

字符其实就是 Python3 里面的 bytes(字节), 也就是说 Python3 的 bytes 就可以看做 Python2 的 str, 而在 Python3 中的 str 相当于 Python2 的 unicode。

# s = 'abc' # str
# u = u'abc' # unicode
# b = b'abc' # bytes

# Python2
>>> s = 'abc'
>>> u = u'abc'
>>> b = b'abc'
>>> s, u, b
('abc', u'abc', 'abc') # s 和 b 一样

# Python3
>>> s = 'abc'
>>> u = u'abc'
>>> b = b'abc'
>>> s, u, b
('abc', 'abc', b'abc') # s 和 u 一样

str <-> unicode

str 和 unicode 是两种 字符串数据类型, 他们都是 basestring 的子类:

unicode 对象, 存储的是一个抽象的 code points 序列。
str 对象, 存储字节序列, 计算机能看懂, 人类难理解。但它可以被映射到一个 code points 序列。不同的 unicode 编码方法(如 UTF-8)映射不同的字节序列到代码点。

# Python2(Linux下)
>>> s = '中文'

>>> s
'\xe4\xb8\xad\xe6\x96\x87'

>>> len(s)
6

>>> s = 'abc'

>>> len(s)
3

>>> u'中文' # unicode 对象
u'\u4e2d\u6587'

>>> unicode('abcdef') # 使用 unicode(), 第一个参数是 str 对象, 如果全是一个 ascii 的字符, 就不需要其他参数
u'abcdef'

>>> unicode('中文') # 非 ascii 的字符, 报错
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

>>> unicode('中文', 'utf-8') # 需要指定编码
u'\u4e2d\u6587'

>>> unicode('中文', 'gb18030') # 不同的字符集映射的代码点不同, 长度也不同
u'\u6d93\ue15f\u6783'

>>> len(unicode('中文', 'gb18030'))
3

>>> len(unicode('中文', 'utf-8'))
2

Python2 unicode 和 str 之间的转换

# Python2
>>> s = '中文'

>>> s.encode('utf-8') # 对 str 编码会报 UnicodeDecodeError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

>>> u = s.decode('utf-8') # str -> decode('the_coding_of_str') -> unicode

>>> u
u'\u4e2d\u6587'

>>> u.decode('utf-8') # 对 unicode 解码会报 UnicodeEncodeError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

>>> u.encode('utf-8') # unicode -> encode('the_coding_you_want') -> str
'\xe4\xb8\xad\xe6\x96\x87'

>>> print u.encode('utf-8')
中文

>>> print u.encode('gbk') # 使用不对应的编码规则就会出现乱码
����

Python2 里面的 str 和 unicode 的编码问题很麻烦。在 Python3 中进行了改进, str 直接使用 unicode 进行存储, 而且输出的可读性更好。

# Python3
>>> s = '中文'

>>> s # str 类型输出可读性更好
'中文'

>>> len(s) # Python2 中 unicode 的长度
2

>>> bytes('中文')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

>>> bytes('中文', 'utf-8') # 需要制定编码规则
b'\xe4\xb8\xad\xe6\x96\x87'

>>> len(bytes('中文', 'utf-8')) # 这里 6 的长度其实就是 Python2 中 str 的长度
6

>>> b'中文' # 非 ascii 字符不能使用 b 前缀
  File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.

>>> b'abc'
b'abc'

Python 3 中的转换

# Python3
>>> s = '中文'

>>> s.decode('utf-8') # 报错相对于 Python2 中的 UnicodeEncodeError 更明确
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

>>> b = s.encode('utf-8')

>>> type(b)
<class 'bytes'>

>>> b
b'\xe4\xb8\xad\xe6\x96\x87'

>>> b.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'

>>> b.decode('utf-8')
'中文'

>>> b.decode('gbk') # 使用错误的解码规则才会报错 UnicodeDecodeError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2: illegal multibyte sequence

Python3 转换时不应该再用 str 这个类

Python3 中, 当变量是一个 bytes 类型的时候, 使用 str 会把 bytes 直接变成字符串。

>>> b
b'\xe4\xb8\xad\xe6\x96\x87'

>>> str(b)
"b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'"

str, bytes, unicode
1. Python2 / Python3
2. str unicode