python - 区切り文字と値を保持しながら、Python正規表現の分割文字列

Question

name：value要素を含むテキストファイルを「name：value」を含むリストに解析しようとしています...ここにひねりがあります：値は複数の単語または複数の行になることがあり、区切り文字は固定セットではありません言葉の。これが私が取り組んでいるものの例です...

listing="price:44.55 name:John Doe title:Super Widget description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!

私が返したいのは...

["price:44.55", "name:John Doe", "title:Super Widget", "description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!"]

これが私がこれまでに試したことです...

details = re.findall(r'[\w]+:.*', post, re.DOTALL)
["price:", "44.55 name:John Doe title:Super Widget description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!"]

私が欲しいものではありません。または...

details = re.findall(r'[\w]+:.*?', post, re.DOTALL)
["price:", "name:", "title:", "description:"]

私が欲しいものではありません。または...

details = re.split(r'([\w]+:)', post)
["", "price:", "44.55", "name:", "John Doe", "title:", "Super Widget", "description:", "This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!"]

これは近いですが、それでもサイコロはありません。また、空のリストアイテムを処理することもできます。したがって、基本的に、私の質問は、re.split（）の値で区切り文字を保持する方法、またはre.findall（）が貪欲すぎたりけちすぎたりしないようにする方法です。

読んでくれてありがとう！

score 5 · Accepted Answer

先読みアサーションを使用します。

>>> re.split(r'\s(?=\w+:)', post)
['price:44.55',
 'name:John Doe',
 'title:Super Widget',
 'description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!']

もちろん、値の直後にコロンが続く単語がある場合でも失敗します。

score 2 · Accepted Answer

@Pavelの答えはもっと良いですが、最後の試みの結果をマージすることもできます。

# kill the first empty bit
if not details[0]:
    details.pop(0)

return [a + b for a, b in zip(details[::2], details[1::2])]

python - 区切り文字と値を保持しながら、Python正規表現の分割文字列

2 に答える 2

Related

Reference