python - Python の BeautifulSoup が正しく解析されない

Question

私は Python 2.7.5 を実行しており、これから説明する内容に組み込みの html パーサーを使用しています。

私が達成しようとしているタスクは、本質的にレシピである html のチャンクを取得することです。ここに例があります。

html_chunk = "<h1>Miniature Potato Knishes</h1>Posted by bettyboop50 at recipegoldmine.com May 10, 2001Makes about 42 miniature knishesThese are just yummy for your tummy!3 cups mashed potatoes (about     2 very large potatoes) 2 eggs, slightly beaten 1 large onion, diced 2 tablespoons margarine 1 teaspoon salt (or to taste) 1/8 teaspoon black pepper 3/8 cup Matzoh meal 1 egg yolk, beaten with 1 tablespoon waterPreheat oven to 400 degrees F.Sauté diced onion in a small amount of butter or margarine until golden brown.In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned."

目標は、ヘッダー、ジャンク、材料、指示、サービング、および材料の数を分離することです.

これがそれを達成する私のコードです

from bs4 import BeautifulSoup

def list_to_string(list):
   joined = ""
   for item in list:
      joined += str(item)
   return joined

def get_ingredients(soup):
   for p in soup.find_all('p'):
      if p.find('br'):
         return p

def get_instructions(p_list, ingredient_index):
   instructions = []
   instructions += p_list[ingredient_index+1:]
   return instructions

def get_junk(p_list, ingredient_index):
   junk = []
   junk += p_list[:ingredient_index]
   return junk

def get_serving(p_list):
   for item in p_list:
      item_str = str(item).lower()
      if ("yield" or "make" or "serve" or "serving") in item_str:
         yield_index = p_list.index(item)
         del p_list[yield_index]
         return item

def ingredients_count(ingredients):
   ingredients_list = ingredients.find_all(text=True)
   return len(ingredients_list)

def get_header(soup):
   return soup.find('h1')

def html_chunk_splitter(soup):
   ingredients = get_ingredients(soup)
   if ingredients == None:
      error = 1
      header = ""
      junk_string = ""
      instructions_string = ""
      serving = ""
      count = ""
   else:
      p_list = soup.find_all('p')
      serving = get_serving(p_list)
      ingredient_index = p_list.index(ingredients)
      junk_list = get_junk(p_list, ingredient_index)
      instructions_list = get_instructions(p_list, ingredient_index)
      junk_string = list_to_string(junk_list)
      instructions_string = list_to_string(instructions_list)
      header = get_header(soup)
      error = ""
      count = ingredients_count(ingredients)
   return (header, junk_string, ingredients, instructions_string, 
   serving, count, error)

Sauté が Sauté に変わるような文字列を含むチャンクがある状況を除いて、うまく機能します。html_chunk のようなレシピの巨大な csv ファイルがあり、それらすべてを適切に構造化しようとしているため、これは問題"Sauté"です。soup = BeautifulSoup(html_chunk)そして、出力をデータベースに戻します。このhtmlプレビューアを使用してSautéが正しく出力されることを確認してみましたが、それでもSautéとして出力されます。私はこれについて何をすべきかわかりません。

奇妙なのは、BeautifulSoup のドキュメントに示されていることを行うときです。

BeautifulSoup("Sacr&eacute; bleu!")
# <html><head></head><body>Sacré bleu!</body></html>

私は得る

# Sacr├⌐ bleu!

しかし、私の同僚は、端末から実行している自分の Mac でそれを試したところ、ドキュメントが示しているとおりの結果が得られました。

私は本当にあなたの助けに感謝します. ありがとうございました。

score 0 · Accepted Answer

これは解析の問題ではありません。むしろ、エンコーディングに関するものです。

非 ASCII 文字を含む可能性のあるテキスト (または、コメントやドキュメント文字列などにそのような文字を含む Python プログラム) を扱う場合は常に、1 行目または - シバン行の後 - 2 行目にコーディング Cookie を配置する必要があります。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

...そして、これがファイルエンコーディングと一致することを確認してください（vim:を使用:set fenc=utf-8）。

score 0 · Accepted Answer

BeautifulSoup はエンコーディングを推測しようとしますが、間違いを犯すこともありますが、from_encodingパラメータを追加することでエンコーディングを指定できます。たとえば、

soup = BeautifulSoup(html_text, from_encoding="UTF-8")

エンコードは通常、Web ページのヘッダーで利用できます。

python - Python の BeautifulSoup が正しく解析されない

2 に答える 2

Related

Reference