python - Python3.6で2つのjsonl（json行）ファイルをマージして新しいjsonlファイルに書き込みます

Question

こんにちは、次jsonlのような 2 つのファイルがあります。

one.jsonl

{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}

second.jsonl

{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}

そして、私の目標は、次のような新しいjsonlファイル名 (エンコーディングを保持したまま)を作成merged_file.jsonlすることです。

{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}
{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}

私のアプローチは次のようなものです：

import json
import glob

result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        try:
            result.append(extract_json(infile)) #tried json.loads(infile) too
        except ValueError:
            print(f)

#write the file in BOM TO preserve the emojis and special characters
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
    json.dump(result, outfile)

しかし、私はこのエラーに遭遇しました: TypeError: Object of type generator is not JSON serializable私はあなたのヒント/ヘルプを何らかの方法で感謝します. ありがとうございました！私は他のSOリポジトリを見てきましたが、それらはすべて通常のjsonファイルを書き込んでおり、私の場合も機能するはずですが、失敗し続けています。

このように単一のファイルを読み取ると、次のように機能します。

data_json = io.open('one.jsonl', mode='r', encoding='utf-8-sig') # Opens in the JSONL file
data_python = extract_json(data_json)
for line in data_python:
    print(line)

####outputs####
#{'name': 'one', 'description': 'testDescription...', 'comment': '1'}
#{'name': 'two', 'description': 'testDescription2...', 'comment': '2'}

score 3 · Accepted Answer

extract_json は、jsonl であるため json シリアライズ可能な list/dict の代わりにジェネレーターを返す可能性があります
。これは、各行が有効な json であることを意味する
ため、既存のコードを少し調整するだけで済みます。

import json
import glob

result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        for line in infile.readlines():
            try:
                result.append(json.loads(line)) # read each line of the file
            except ValueError:
                print(f)

# This would output jsonl
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
    #json.dump(result, outfile)
    #write each line as a json
    outfile.write("\n".join(map(json.dumps, result)))

考えてみると、jsonを使用してロードする必要さえありませんでしたが、フォーマットが不適切なJSON行をサニタイズするのに役立ちます.

このようにすべての行を一発で集めることができます

outfile = open('merged_file.jsonl','w', encoding= 'utf-8-sig')
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        for line in infile.readlines():
            outfile.write(line)
outfile.close()

python - Python3.6で2つのjsonl（json行）ファイルをマージして新しいjsonlファイルに書き込みます

2 に答える 2

Related

Reference