python - Django slugify を Unicode 文字列で適切に動作させるには?

Question

slugifyフィルターが非 ASCII 英数字を削除しないようにするにはどうすればよいですか? （私はDjango 1.0.2を使用しています）

cnprog.comの質問 URL に漢字が含まれているので、そのコードを調べてみました。slugifyテンプレートでは使用していません。代わりに、Questionモデルでこのメソッドを呼び出してパーマリンクを取得しています。

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

彼らはURLをスラッグ化していますか？

score 100 · Accepted Answer

私がaskbotQ＆Aフォーラムに採用したunidecodeと呼ばれるpythonパッケージがあります。これはラテン語ベースのアルファベットでうまく機能し、ギリシャ語でも妥当に見えます。

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'

それはアジアの言語で奇妙なことをします：

>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>>

これは意味がありますか？

askbotでは、次のようにスラッグを計算します。

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))

score 24 · Accepted Answer

Mozilla Webサイトチームは実装に取り組んでいます： https ：//github.com/mozilla/unicode-slugify サンプルコード（ http://davedash.com/2011/03/24/how-we-slug-at-mozilla ） //

score 23 · Accepted Answer

Django >= 1.9では、パラメータdjango.utils.text.slugifyがあります:allow_unicode

>>> slugify("你好 World", allow_unicode=True)
"你好-world"

Django <= 1.8 を使用している場合 (2018 年 4 月以降は使用しないでください)、Django 1.9 からコードを取得できます。

score 15 · Accepted Answer

また、slugify の Django バージョンは re.UNICODE フラグを使用しないため、\w\sASCII 以外の文字に関連するの意味を理解しようとさえしません。

このカスタムバージョンは私にとってうまく機能しています:

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

最後の正規表現の置換に注意してください。これは、次の Python インタープリターセッションに示されているように、より堅牢な expression の問題に対する回避策です。この問題は、r'\W'一部の非 ASCII 文字を削除するか、誤って再エンコードするように見えます。

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

上記の問題が何であるかはわかりませんが、「Unicode 文字プロパティデータベースで英数字として分類されているもの」と、それがどのように実装されているかに起因すると推測しています。Python 3.x では Unicode 処理の改善が優先されていると聞いたので、これは既に修正されている可能性があります。または、おそらくそれは正しい python の動作であり、ユニコードや中国語を誤用しています。

現時点での回避策は、文字クラスを避け、明示的に定義された文字セットに基づいて置換することです。

score 9 · Accepted Answer

残念ながら、ジャンゴのスラッグの定義は ascii を意味しますが、ジャンゴのドキュメントには明示的に記載されていません。これは slugify の defaultfilters のソースです...エラーの場合は「無視」オプションを使用して、値が ascii に変換されていることがわかります。

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

それを踏まえると、cnprog.com は正式なslugify機能を使用していないと推測できます。別の動作が必要な場合は、上記の django スニペットを調整することをお勧めします。

ただし、URL の RFC では、us-ascii 以外の文字 (より具体的には、英数字と $-_.+!*'() 以外のもの) は %hex 表記を使用してエンコードする必要があると述べています。 . ブラウザが送信する実際の生の GET リクエスト (たとえば、Firebug を使用) を見ると、中国語の文字が送信前に実際にエンコードされていることがわかります... ブラウザはディスプレイできれいに見えるようにします。これが、slugify が ascii のみを主張する理由だと思います。

score 8 · Accepted Answer

あなたが見たいと思うかもしれません: https://github.com/un33k/django-uuslug

両方の「U」を処理します。Uは一意で、Uは Unicode です。

それはあなたのために手間のかからない仕事をします。

score 4 · Accepted Answer

これは私が使用するものです：

http://trac.django-fr.org/browser/site/trunk/djangofr/links/slughifi.py

SlugHiFi は、通常の slugify のラッパーですが、各国の文字を対応する英語のアルファベットに置き換えるという違いがあります。

したがって、"Ą" の代わりに "A" が得られ、"Ł" の代わりに => "L" などが得られます。

python - Django slugify を Unicode 文字列で適切に動作させるには?

8 に答える 8

Related

Reference