python - JSONからUnicodeの代わりに文字列オブジェクトを取得するには?

Question

Python 2を使用して、 ASCII エンコードされたテキストファイルから JSON を解析しています。

jsonこれらのファイルをまたはでロードするとsimplejson、すべての文字列値が文字列オブジェクトではなく Unicode オブジェクトにキャストされます。問題は、文字列オブジェクトのみを受け入れるいくつかのライブラリでデータを使用する必要があることです。ライブラリを変更したり、更新したりできません。

Unicode オブジェクトの代わりに文字列オブジェクトを取得することは可能ですか?

例

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

アップデート

この質問は、私がPython 2で立ち往生していたとき、ずっと前に尋ねられました。今日の簡単でクリーンな解決策の 1 つは、Python の最新バージョン (つまり、Python 3以降) を使用することです。

score 187 · Accepted Answer

ここにはいくつかの良い答えがありますが、PyYAMLを使用して JSON ファイルを解析することになりましstrたunicode。JSON は YAML のサブセットであるため、うまく機能します。

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

ノート

ただし、注意すべき点がいくつかあります。

すべてのエントリがASCII エンコードされているため、文字列オブジェクトを取得します。Unicode でエンコードされたエントリを使用すると、Unicode オブジェクトとして返されます— 変換はありません!
(おそらく常に) PyYAML のsafe_load関数を使用する必要があります。それを使用して JSON ファイルをロードする場合は、loadとにかく関数の「追加機能」は必要ありません。
仕様の 1.2 バージョンをさらにサポートする (そして非常に低い数値を正しく解析する) YAML パーサーが必要な場合は、 Ruamel YAMLpip install ruamel.yamlを試してくださいimport ruamel.yaml as yaml。

変換

述べたように、変換はありません！ASCII 値のみを処理するかどうか確信が持てない場合 (そして、ほとんどの場合確信が持てない場合) は、変換関数を使用することをお勧めします。

Mark Ameryのものを数回使用しましたが、うまく機能し、非常に使いやすいです。object_hook大きなファイルのパフォーマンスが向上する可能性があるため、代わりに同様の関数を使用することもできます。そのためには、Mirec Miskuf からのもう少し複雑な回答を参照してください。

score 145 · Accepted Answer

json モジュール関数が Unicode 文字列ではなくバイト文字列を返すようにする組み込みオプションはありません。ただし、この短く単純な再帰関数は、デコードされた JSON オブジェクトを Unicode 文字列の使用から UTF-8 でエンコードされたバイト文字列に変換します。

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

json.loadまたは呼び出しから取得した出力でこれを呼び出すだけjson.loadsです。

いくつかのメモ:

Python 2.6 以前をサポートするreturn {byteify(key): byteify(value) for key, value in input.iteritems()}にreturn dict([(byteify(key), byteify(value)) for key, value in input.iteritems()])は、辞書内包表記が Python 2.7 までサポートされていなかったため、に置き換えます。
この回答はデコードされたオブジェクト全体を再帰的に処理するため、object_hookまたはobject_pairs_hookパラメータを慎重に使用することで回避できる望ましくないパフォーマンス特性がいくつかあります。Mirec Miskuf の答えは、これまでのところ、これを正しくやってのけることができた唯一のものですが、結果として、私のアプローチよりもはるかに複雑です。

score 115 · Accepted Answer

ソリューション`object_hook`

[編集]: Python 2.7および3.x との互換性のために更新されました。

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    if isinstance(data, str):
        return data

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.items() # changed to .items() for python 2.7/3
        }

    # python 3 compatible duck-typing
    # if this is a unicode string, return its string representation
    if str(type(data)) == "<type 'unicode'>":
        return data.encode('utf-8')

    # if it's anything else, return it in its original form
    return data

使用例:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

これはどのように機能し、なぜ使用するのですか?

Mark Amery の機能は、これらのものよりも短くて明確ですが、それらのポイントは何ですか? なぜそれらを使用したいのですか？

純粋にパフォーマンスのため。マークの答えは、最初に JSON テキストをユニコード文字列で完全にデコードし、次にデコードされた値全体を再帰的に処理して、すべての文字列をバイト文字列に変換します。これには、いくつかの望ましくない影響があります。

デコードされた構造全体のコピーがメモリに作成されます
JSON オブジェクトが非常に深くネストされている場合( 500 レベル以上)、Python の最大再帰深度に達します。

この回答は、とのobject_hookパラメーターを使用することで、これらのパフォーマンスの問題の両方を軽減します。ドキュメントから：json.loadjson.loads

object_hookデコードされたオブジェクトリテラル (a) の結果で呼び出されるオプションの関数ですdict。の代わりに object_hook の戻り値が使用されますdict。この機能を使用して、カスタムデコーダーを実装できます。

他の辞書の多くのレベルにネストされた辞書は、object_hook デコードされたときに渡されるため、その時点でその中の文字列またはリストをバイト化し、後で深い再帰の必要性を回避できます。

object_hookマークの答えは、ネストされた辞書に再帰するため、そのままでは使用に適していません。この回答では、ignore_dictsパラメーター to を使用してその再帰を防ぎます。これは、新しいバイト化を渡す場合を除い_byteifyて、常に渡されます。フラグは、既にバイト化されているため、 sを無視するように指示します。object_hookdictignore_dicts_byteifydict

最後に、デコードされる JSON テキストの最上位レベルにがない場合を処理するために、 orから返された結果に対してjson_load_byteifiedandを(with で)json_loads_byteified呼び出します。_byteifyignore_dicts=Truejson.loadjson.loadsdict

score 75 · Accepted Answer

object_hookパラメータ forを使用json.loadsしてコンバータを渡すことができます。事後に変換を行う必要はありません。モジュールはjson常にobject_hookdict のみを渡し、ネストされた dict を再帰的に渡すため、ネストされた dict を自分で再帰する必要はありません。Wells が示すように、Unicode 文字列を数値に変換するとは思いません。Unicode文字列の場合、JSONファイルで文字列として引用されていたので、文字列であるはずです(またはファイルが悪い)。

また、オブジェクトのようなことを避けるようstr(val)にしunicodeます。value.encode(encoding)外部ライブラリが期待するものに応じて、有効なエンコーディングで使用する必要があります。

たとえば、次のようになります。

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)

score 39 · Accepted Answer

これは、json が文字列オブジェクトと Unicode オブジェクトの間に違いがないためです。それらはすべて JavaScript の文字列です。

Unicode オブジェクトを返すのは JSON が正しいと思います。実際、JavaScript 文字列は実際にはunicodeオブジェクトであるため(つまり、JSON (javascript) 文字列はあらゆる種類unicodeの Unicode 文字を格納できる) 、JSON から文字列を変換するときにオブジェクトを作成することは理にかなっています。ライブラリは必要なエンコーディングを推測する必要があるため、プレーンな文字列は適合しません。

unicodeどこでも文字列オブジェクトを使用することをお勧めします。したがって、最善の選択肢は、ライブラリを更新して、Unicode オブジェクトを処理できるようにすることです。

ただし、本当にバイト文字列が必要な場合は、結果を選択したエンコーディングにエンコードするだけです。

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

score 16 · Accepted Answer

簡単な回避策があります。

TL;DR -ast.literal_eval()の代わりに使用しjson.loads()ます。astとの両方jsonが標準ライブラリに含まれています。

「完璧な」答えではありませんが、Unicode を完全に無視する計画であれば、かなりの答えになります。Python 2.7 では

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

与えます：

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

一部のオブジェクトが実際に Unicode 文字列である場合、これはさらに複雑になります。完全な答えはすぐに毛むくじゃらになります。

score 9 · Accepted Answer

残念ながら、simplejson ライブラリ内でこれを自動的に達成する方法はありません。

simplejson のスキャナーとデコーダーは、Unicode テキストを生成するように設計されています。これを行うために、ライブラリは呼び出された関数を使用しますc_scanstring(使用可能な場合は高速化のため)、またはpy_scanstringC バージョンが使用できない場合。このscanstring関数は、テキストを含む可能性のある構造をデコードするために simplejson が持つほぼすべてのルーチンによって数回呼び出されます。simplejson.decoder のscanstring値にモンキーパッチを適用するか、サブクラスJSONDecoder化して、テキストを含む可能性のあるもののほぼ完全な独自の実装を提供する必要があります。

ただし、simplejson が unicode を出力する理由は、json 仕様で「文字列は 0 個以上の Unicode 文字のコレクションである」と具体的に述べられているためです。Unicode のサポートは、形式自体の一部として想定されています。Simplejson のscanstring実装は、Unicode エスケープをスキャンして解釈する (不正なマルチバイト文字セット表現のエラーチェックも行う) まで行っているため、確実に値を返す唯一の方法は Unicode としてです。

を必要とする古いライブラリがある場合は、str解析後にネストされたデータ構造を苦労して検索することをお勧めします（これは、避けたいと明示的に言ったことを認めます...申し訳ありません）、ライブラリをある種のより詳細なレベルで入力パラメータを操作できるファサード。データ構造が実際に深くネストされている場合、2 番目のアプローチは最初のアプローチよりも扱いやすい可能性があります。

score 3 · Accepted Answer

落とし穴は、少なくともユニコードを処理する方法では、2つの異なるモジュールですsimplejson。jsonあなたはjsonpy2.6+にあり、これはあなたにユニコード値を与えますが、simplejson文字列オブジェクトを返します。ご使用の環境でeasy_install-ingsimplejsonを試して、それが機能するかどうかを確認してください。それは私のためになりました。

score 2 · Accepted Answer

次のように、ダンプとロードに json の代わりに pickle を使用します。

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

生成される出力は次のとおりです (文字列と整数は正しく処理されます)。

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}

score 1 · Accepted Answer

だから、私は同じ問題に遭遇しました。Google の最初の検索結果は何だったと思いますか。

すべてのデータを PyGTK に渡す必要があるため、Unicode 文字列もあまり役に立ちません。だから私は別の再帰的な変換方法を持っています。実際には、タイプセーフな JSON 変換にも必要です。json.dump() は、Python オブジェクトなどの非リテラルを無効にします。ただし、dict インデックスは変換しません。

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

score 0 · Accepted Answer

C で記述された再帰的エンコーダーは次のとおりです: https://github.com/axiros/nested_encode

「平均」構造のパフォーマンスオーバーヘッドは、json.loads と比較して約 10% です。

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

このテスト構造を使用して：

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)

score 0 · Accepted Answer

これはゲームに遅れていますが、この再帰キャスターを作成しました。それは私のニーズに合っていますし、比較的完成度が高いと思います。それはあなたを助けるかもしれません。

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

次のように JSON オブジェクトを渡すだけです。

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

私はそれをクラスのプライベートメンバーとして持っていますが、必要に応じてメソッドを再利用できます。

score 0 · Accepted Answer

Wells の _parse_json() を書き直して、json オブジェクト自体が配列である場合 (私の使用例) を処理しました。

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj

score -1 · Accepted Answer

特にダックタイピングのプロを取り除くために、Mark Ameryの回答からコードを適応させました。isinstance

エンコードは手動で行われ、ensure_ascii無効になっています。python docs forjson.dumpはそれを言います

ensure_ascii が True (デフォルト) の場合、出力内のすべての非 ASCII 文字は \uXXXX シーケンスでエスケープされます

免責事項: doctest ではハンガリー語を使用しました。注目すべきハンガリー関連の文字エンコーディングは次cp852のとおりです。DOS（ asciiと呼ばれることもありますが、コードページの設定に依存していると思います）では、cp1250たとえば. Windows では (ロケール設定に応じてansiと呼ばれることもあります)、 iso-8859-2http サーバーで使用されることもあります。テストテキストTüskéshátú kígyóbűvölőは、Koltai László (ネイティブの個人名形式) に起因するものであり、wikipediaからのものです。

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

また、 JSON仕様を参照するJarret Hardieの回答を強調したいと思います.

文字列は 0 個以上の Unicode 文字の集合です

私のユースケースでは、json を含むファイルがありました。それらはutf-8エンコードされたファイルです。ensure_ascii適切にエスケープされますが、あまり読みにくいjsonファイルになります。そのため、Mark Ameryの回答をニーズに合わせて調整しました。

doctest は特に思慮深いものではありませんが、誰かの役に立つことを願ってコードを共有します。

score -1 · Accepted Answer

私もこの問題に遭遇し、JSON を処理する必要があり、Unicode キーを文字列に変換する小さなループを思いつきました。( simplejsonGAE では文字列キーは返されません。)

objJSON からデコードされたオブジェクトです。

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargsunicodeは、GAE アプリケーションのコンストラクターに渡すものです (これはのキーが好きではありません**kwargs) 。

Wells のソリューションほど堅牢ではありませんが、はるかに小さいです。

python - JSONからUnicodeの代わりに文字列オブジェクトを取得するには?

例

アップデート

21 に答える 21

ノート

変換

ソリューションobject_hook

これはどのように機能し、なぜ使用するのですか?

Related

Reference

ソリューション`object_hook`