このページをダウンロードしてマイナーな編集を行い、この段落の最初の65を68に変更します。
次に、両方のソースをBeauifulSoupで解析し、それらをdifflibで比較します。
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read() # get response as list of lines
url2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read() # get response as list of lines
import difflib
d = difflib.Differ()
diffed = d.compare(content, content)
soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or change.startswith('+')]
for change in changes:
print change
変更を印刷すると、次のようになります。
- The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA). This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill). Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.
+ The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA). This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill). Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.
したがって、非常に小さな変更にもかかわらず、段落全体を印刷しています。文ごとではなくパラグラフごとに差分を表示するのは良いことだと思いますが、どうにかして出力をより細かくすることはできますか? 現状では、変更されたテキストだけを強調表示したい場合は、これら 2 つのほぼ同一の文字列のデルタ比較を追加で行う必要があるようです。