python - Python でのファジースマート数値解析

Question

不明な形式に関係なく、10 進数を解析したいと考えています。元のテキストの言語は不明であり、異なる場合があります。さらに、ソース文字列には、通貨や単位など、前後に追加のテキストを含めることができます。

私は以下を使用しています：

# NOTE: Do not use, this algorithm is buggy. See below.
def extractnumber(value):

    if (isinstance(value, int)): return value
    if (isinstance(value, float)): return value

    result = re.sub(r'&#\d+', '', value)
    result = re.sub(r'[^0-9\,\.]', '', result)

    if (len(result) == 0): return None

    numPoints = result.count('.')
    numCommas = result.count(',')

    result = result.replace(",", ".")

    if ((numPoints > 0 and numCommas > 0) or (numPoints == 1) or (numCommas == 1)):
        decimalPart = result.split(".")[-1]
        integerPart = "".join ( result.split(".")[0:-1] )
    else:
        integerPart = result.replace(".", "")

    result = int(integerPart) + (float(decimalPart) / pow(10, len(decimalPart) ))

    return result

こういう作品...

>>> extractnumber("2")
2
>>> extractnumber("2.3")
2.3
>>> extractnumber("2,35")
2.35
>>> extractnumber("-2 000,5")
-2000.5
>>> extractnumber("EUR 1.000,74 €")
1000.74

>>> extractnumber("20,5 20,8") # Testing failure...
ValueError: invalid literal for int() with base 10: '205 208'

>>> extractnumber("20.345.32.231,50") # Returns false positive
2034532231.5

そのため、私の方法は非常に壊れやすく、多くの誤検出を返します。

これを処理できるライブラリまたはスマート関数はありますか? 理想的20.345.32.231,50には合格しませんが、他のテキストや文字 (改行を含む) の量に関係なく、他の言語の数値は抽出されます1.200,50。1 200'50

(受け入れられた回答に従って更新された実装: https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91 )

score 2 · Accepted Answer

私はあなたのコードを少しリファコアしました。これは、以下の関数と一緒にvalid_numberうまくいくはずです。

私が時間をかけてこのひどいコードを書いた主な理由は、将来の読者に、正規表現の使用方法を知らない場合に正規表現の解析がいかにひどいものになるかを示すためです (たとえば、私のように)。

うまくいけば、私よりも正規表現をよく知っている人が、それがどのように行われるべきかを教えてくれるでしょう:)

拘束する

.、桁区切り記号と小数点記号の両方として受け入れられ,ます'
2 つ以下の異なるセパレーター
複数回出現するセパレーターは最大 1 つ
区切り記号が 1 つだけ存在し、その種類が 1 つだけの場合、区切り記号は小数点記号として扱われます。(つまり、ではなく123,456として解釈されます)123.456123456
' '文字列はダブルスペース ( )で数字のリストに分割されます
1000 区切りの数字の最初の部分を除くすべての部分は、3 桁の長さである必要があります (123,456.00両方1,345.00とも有効と見なされますが、有効とは2345,11.00見なされません) 。

コード

import re

from itertools import combinations

def extract_number(value):
    if (isinstance(value, int)) or (isinstance(value, float)):
        yield float(value)
    else:
        #Strip the string for leading and trailing whitespace
        value = value.strip()
        if len(value) == 0:
            raise StopIteration
        for s in value.split('  '):
            s = re.sub(r'&#\d+', '', s)
            s = re.sub(r'[^\-\s0-9\,\.]', ' ', s)
            s = s.replace(' ', '')
            if len(s) == 0:
                continue
            if not valid_number(s):
                continue
            if not sum(s.count(sep) for sep in [',', '.', '\'']):
                yield float(s)
            else:
                s = s.replace('.', '@').replace('\'', '@').replace(',', '@')
                integer, decimal = s.rsplit('@', 1)
                integer = integer.replace('@', '')
                s = '.'.join([integer, decimal])
                yield float(s)

さて、これは、おそらくいくつかの正規表現ステートメントで置き換えることができるコードです。

def valid_number(s):
    def _correct_integer(integer):
        # First number should have length of 1-3
        if not (0 < len(integer[0].replace('-', '')) < 4):
            return False
        # All the rest of the integers should be of length 3
        for num in integer[1:]:
            if len(num) != 3:
                return False
        return True
    seps = ['.', ',', '\'']
    n_seps = [s.count(k) for k in seps]

    # If no separator is present
    if sum(n_seps) == 0:
        return True

    # If all separators are present
    elif all(n_seps):
        return False

    # If two separators are present
    elif any(all(c) for c in combinations(n_seps, 2)):
        # Find thousand separator
        for c in s:
            if c in seps:
                tho_sep = c
                break

        # Find decimal separator:
        for c in reversed(s):
            if c in seps:
                dec_sep = c
                break

        s = s.split(dec_sep)

        # If it is more than one decimal separator
        if len(s) != 2:
            return False

        integer = s[0].split(tho_sep)

        return _correct_integer(integer)

    # If one separator is present, and it is more than one of it
    elif sum(n_seps) > 1:
        for sep in seps:
            if sep in s:
                s = s.split(sep)
                break
        return _correct_integer(s)

    # Otherwise, this is a regular decimal number
    else:
        return True

出力

extract_number('2'                  ):  [2.0]
extract_number('.2'                 ):  [0.2]
extract_number(2                    ):  [2.0]
extract_number(0.2                  ):  [0.2]
extract_number('EUR 200'            ):  [200.0]
extract_number('EUR 200.00  -11.2'  ):  [200.0, -11.2]
extract_number('EUR 200  EUR 300'   ):  [200.0, 300.0]
extract_number('$ -1.000,22'        ):   [-1000.22]
extract_number('EUR 100.2345,3443'  ):  []
extract_number('111,145,234.345.345'):  []
extract_number('20,5  20,8'         ):  [20.5, 20.8]
extract_number('20.345.32.231,50'   ):  []

python - Python でのファジースマート数値解析

2 に答える 2

拘束する

コード

出力

Related

Reference