从具有正确编码(UTF-8)的字符串生成列表

问题描述:

我很难尝试使用正确的UTF-8编码从字符串生成列表,我正在使用Python(我只是在学习编程,所以对我愚蠢的问题/糟糕的编码一无所知).

I'm having a hard time trying to generate a list from a string, with a proper UTF-8 encoding, I'm using Python (I'm just learning to program, so bare with my silly question/terrible coding).

源文件是tweet提要(JSON格式),在成功解析它并从所有其余部分中提取tweet消息后,我设法仅在打印后(以字符串形式)获得具有正确编码的文本.如果我尝试将其打包成列表形式,它会回到未编码的u\000000形式.

The source file is a tweet feed (JSON format), after parsing it successfully and extracting the tweet message from all the rest I manage to get the text with the right encoding only after a print (as a string). If I try to put it pack into list forms, it goes back to unencoded u\000000 form.

我的代码是:

import json

with open("file_name.txt") as tweets_file:
    tweets_list = [] 
    for a in tweets_file:
        b = json.loads(a)
        tweets_list.append(b)

    tweet = []
    for i in tweets_list:
        key = "text"
        if key in i:
            t = i["text"]
            tweet.append(t)

    for k in tweet:
        print k.encode("utf-8")

作为替代方案,我尝试在开始时(在获取文件时)进行编码:

As an alternative, I tried to have the encoding at the beginning (when fetching the file):

import json
import codecs

tweets_file = codecs.open("file_name.txt", "r", "utf-8")
tweets_list = [] 
for a in tweets_file:
    b = json.loads(a)
    tweets_list.append(b)
tweets_file.close()

tweet = []
for i in tweets_list:
    key = "text"
    if key in i:
        t = i["text"]
        tweet.append(t)

for k in tweet:
    print k

我的问题是:如何将生成的k个字符串放入列表中?每个k串都作为项目吗?

My question is: how can I put the resulting k strings, into a list? With each k string as an item?

Python字符串表示法让您感到困惑.

You are getting confused by the Python string representation.

打印python列表(或任何其他标准的Python容器)时,内容以特殊表示形式显示,从而使调试更加容易;显示的每个值都是在 repr()函数上调用的结果那个价值.对于字符串值,这意味着结果是 unicode字符串表示形式,这与直接打印字符串时看到的东西不同.

When you print a python list (or any other standard Python container), the contents are shown in special representation to make debugging easier; each value is shown is the result of calling the repr() function on that value. For string values, that means the result is a unicode string representation, and that is not the same thing as what you see when the string is printed directly.

Unicode和字节字符串在显示时表示为字符串文字;带引号的值,您可以将这些值直接复制并粘贴回Python代码中,而不必担心编码;任何非可打印ASCII字符的内容均以引号引起来.超出latin-1平面的Unicode代码点显示为'\u....'转义序列.拉丁1范围内的字符使用'\x..转义序列.许多控制字符以1个字母的转义形式显示,例如\n\t.

Unicode and byte strings, when shown like that, are presented as string literals; quoted values that you can copy and paste straight back into Python code without having to worry about encoding; anything that is not a printable ASCII character is shown in quoted form. Unicode code points beyond the latin-1 plane are shown as '\u....' escape sequences. Characters in the latin-1 range use the '\x.. escape sequence. Many control characters are shown in their 1-letter escape form, such as \n and \t.

python交互式提示执行相同的操作;当您使用print在提示符上不带的值上回显一个值时,表示"中的值以repr()形式显示:

The python interactive prompt does the same thing; when you echo a value on the prompt without using print, the value in 'represented', shown in the repr() form:

>>> print u'\u2036Hello World!\u2033'
‶Hello World!″
>>> u'\u2036Hello World!\u2033'
u'\u2036Hello World!\u2033'
>>> [u'\u2036Hello World!\u2033', u'Another\nstring']
[u'\u2036Hello World!\u2033', u'Another\nstring']
>>> print _[1]
Another
string

这是完全正常的行为.换句话说,您的代码有效,没有任何问题.

This entirly normal behaviour. In other words, your code works, nothing is broken.

回到代码,如果您只想从tweet JSON结构中提取'text'键,请在读取文件时进行过滤,而不必为循环两次而烦恼:

To come back to your code, if you want to extract just the 'text' key from the tweet JSON structures, filter while reading the file, don't bother with looping twice:

import json

with open("file_name.txt") as tweets_file:
    tweets = [] 
    for line in tweets_file:
        data = json.loads(a)
        if 'text' in data:
            tweets.append(data['text'])