python-2.7 - Web ページのインデックス作成時に URL を返す xapian の使用方法

Question

Ubuntu 12.04、Python 2.7 を使用しています

特定の URL からコンテンツを取得するための私のコード:

def get_page(url):
'''Gets the contents of a page from a given URL'''
    try:
        f = urllib.urlopen(url)
        page = f.read()
        f.close()
        return page
    except:
        return ""
    return ""

が提供するページのコンテンツをフィルタリングするにはget_page(url):

def filterContents(content):
'''Filters the content from a page'''
    filteredContent = ''
    regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]')
    for words in regex.findall(content):
        word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""")
        for word in word_list:
            filteredContent = filteredContent + word
    return filteredContent

def split_string(source, splitlist):
    return ''.join([ w if w not in splitlist else ' ' for w in source])

クエリを実行すると、クエリが存在する場所が返されるようにインデックスを作成する方法を教えてfilteredContentください。XapianURLs

score 1 · Accepted Answer

あなたfilterContents()とsplit_string()が実際に何をしようとしているのか (HTML タグの内容をいくつか破棄してから単語を分割する) が完全に明確ではないので、その複雑さが組み込まれていない同様の問題について話しましょう。

strip_tags()HTML ドキュメントのテキストコンテンツとその関数だけを返す関数があるとしますget_page()。Xapian データベースを構築したいと考えています。

各ドキュメントは、特定の URL から取得されたリソース表現を表します
その表現 ( を通過したstrip_tags()) の「単語」は、それらのドキュメントを索引付けする検索可能な用語になります。
各ドキュメントには、すべてのプル元の URL がドキュメントデータとして含まれています。

したがって、次のようにインデックスを作成できます。

import xapian
def index_url(database, url):
    text = strip_tags(get_page(url))
    doc = xapian.Document()

    # TermGenerator will split text into words
    # and then (because we set a stemmer) stem them
    # into terms and add them to the document
    termgenerator = xapian.TermGenerator()
    termgenerator.set_stemmer(xapian.Stem("en"))
    termgenerator.set_document(doc)
    termgenerator.index_text(text)

    # We want to be able to get at the URL easily
    doc.set_data(url)
    # And we want to ensure each URL only ends up in
    # the database once. Note that if your URLs are long
    # then this won't work; consult the FAQ on unique IDs
    # for more: http://trac.xapian.org/wiki/FAQ/UniqueIds
    idterm = 'Q' + url
    doc.add_boolean_term(idterm)
    db.replace_document(idterm, doc)

# then index an example URL
db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN)

index_url(db, "https://stackoverflow.com/")

検索は簡単ですが、必要に応じてより洗練されたものにすることもできます。

qp = xapian.QueryParser()
qp.set_stemmer(xapian.Stem("en"))
qp.set_stemming_strategy(qp.STEM_SOME)
query = qp.parse_query('question')
query = qp.parse_query('question and answer')
enquire = xapian.Enquire(db)
enquire.set_query(query)
for match in enquire.get_mset(0, 10):
    print match.document.get_data()

ログインしていないときは「質問と回答」がホームページに表示されるため、「 https://stackoverflow.com/ 」が表示されます。

概念とコードの両方について、 Xapian 入門ガイドをチェックすることをお勧めします。

python-2.7 - Web ページのインデックス作成時に URL を返す xapian の使用方法

1 に答える 1

Related

Reference