python - 抽出した PDF コンテンツを django-haystack と統合する

Question

Solr を使用して PDF/DOCX コンテンツを抽出し、これ専用の次の Solr URL を使用していくつかの検索クエリを確立することに成功しました。

http://localhost:8983/solr/select?q=Lycee

そのようなクエリを django-haystack で確立したいと思います。問題について話しているこのリンクを見つけました：

https://github.com/toastdriven/django-haystack/blob/master/docs/rich_content_extraction.rst

しかし、django-haystack (2.0.0-beta) には「FileIndex」クラスはありません。そのような検索を django-haystack 内に統合するにはどうすればよいですか?

score 1 · Accepted Answer

ドキュメントで参照されている「FileIndex」は、haystack.indexes.SearchIndex の仮想サブクラスです。以下に例を示します。

from haystack import indexes
from myapp.models import MyFile

class FileIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(model_attr='title')
    owner = indexes.CharField(model_attr='owner__name')


    def get_model(self):
        return MyFile

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

    def prepare(self, obj):
        data = super(FileIndex, self).prepare(obj)

        # This could also be a regular Python open() call, a StringIO instance
        # or the result of opening a URL. Note that due to a library limitation
        # file_obj must have a .name attribute even if you need to set one
        # manually before calling extract_file_contents:
        file_obj = obj.the_file.open()

        extracted_data = self.backend.extract_file_contents(file_obj)

        # Now we'll finally perform the template processing to render the
        # text field with *all* of our metadata visible for templating:
        t = loader.select_template(('search/indexes/myapp/myfile_text.txt', ))
        data['text'] = t.render(Context({'object': obj,
                                        'extracted': extracted_data}))

        return data

そのため、PDF/ DOCXextracted_dataコンテンツを抽出するために思いついたプロセスに置き換えられます。次に、テンプレートを更新してそのデータを含めます。

python - 抽出した PDF コンテンツを django-haystack と統合する

1 に答える 1

Related

Reference