python - 傾向の分析と異常な行動の特定

Question

センサーからのデータをログに記録するためのシステムを作成しています。(ただの数字の羅列)

システムを数日間「学習」モードにして、「通常の」動作値が何であるかを確認できるようにしたいと考えています。フラグを立てることができます。データはすべて MySQL データベースに保存されます。

これを実行する方法についての提案は大歓迎です。このトピックについてさらに読むための場所も同様です。

このタスクには、できれば python を使用したいと思います。

日中アクセスおよび使用される温度管理されたエリアでの 5 分間ごとの温度と湿度のデータ。これは、使用中の変動や温度変化があることを意味します。ただし、冷却システムや加熱システムの障害など、これとは異なるものを検出する必要があります

score 1 · Accepted Answer

基本的に、注目すべきは密度推定です。これは、いくつかの変数がどのように動作するかのモデルを決定して、そこからの偏差を探すことができるようにするタスクです。

以下は非常に単純なコード例です。温度と湿度には、変換されていないスケールで独立した正規分布があると仮定しました。

import numpy as np
from matplotlib.mlab import normpdf
from itertools import izip

class TempAndHumidityModel(object):
    def __init__(self):
        self.tempMu=0
        self.tempSigma=1
        self.humidityMu=0
        self.humiditySigma=1

    def setParams(self, tempMeasurements, humidityMeasurements, quantile):
        self.tempMu=np.mean(tempMeasurements)
        self.tempSigma=np.std(tempMeasurements)
        self.humidityMu=np.mean(humidityMeasurements)
        self.humiditySigma=np.std(humidityMeasurements)

        if not 0 < quantile <= 1:
            raise ValueError("Quantile for threshold must be between 0 and 1")

        self._thresholdDensity(quantile, tempMeasurements, humidityMeasurements)

    def _thresholdDensity(self, quantile, tempMeasurements, humidityMeasurements):
        tempDensities = np.apply_along_axis(
            lambda x: normpdf(x, self.tempMu, self.tempSigma),0,tempMeasurements)
        humidityDensities = np.apply_along_axis(
            lambda x: normpdf(x, self.humidityMu, self.humiditySigma),0,humidityMeasurements)

        densities = sorted(tempDensities * humidityDensities, reverse=True)
        #Here comes the massive oversimplification: just choose the
        #density value at the quantile*length position, and use this as the threshold
        self.threshold = densities[int(np.round(quantile*len(densities)))]

    def probOfObservation(self, temp, humidity):
        return normpdf(temp, self.tempMu, self.tempSigma) * \
               normpdf(humidity, self.humidityMu, self.humiditySigma)

    def isNormalMeasurement(self, temp, humidity):
        return self.probOfObservation(temp, humidity) > self.threshold

if __name__ == '__main__':
    #Create some simulated data
    temps = np.random.randn(100)*10 + 50
    humidities = np.random.randn(100)*2 + 10

    thm = TempAndHumidityModel()
    #going to hard code in the 95% threshold
    thm.setParams(temps, humidities, 0.95) 

    #Create some new data from same dist and see how many false positives
    newTemps = np.random.randn(100)*10 + 50
    newHumidities = np.random.randn(100)*2 + 10

    numFalseAlarms = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(newTemps,newHumidities))
    print '{} false alarms!'.format(numFalseAlarms)

    #Now create some abnormal data: mean temp drops to 20
    lowTemps = np.random.randn(100)*10 + 20
    normalHumidities = np.random.randn(100)*2 + 10

    numDetections = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(lowTemps,normalHumidities))
    print '{} abnormal measurements flagged'.format(numDetections)

出力例:

>> 3 false alarms!
>> 77 abnormal measurements flagged

さて、正規性の仮定があなたのデータに適しているかどうかはわかりません（データを別のスケールに変換したい場合があります）。温度と湿度が独立していると仮定するのは、おそらく非常に不正確です。また、分布の要求された分位数に対応する密度値を見つけるために使用したトリックは、分布の逆 CDF を使用するものに置き換える必要があります。ただし、これにより、何をすべきかがわかるはずです。

さらに、多くの優れたノンパラメトリック密度推定器があることに注意してください。カーネル密度推定器がすぐに思い浮かびます。データが標準分布のように見えない場合は、これらの方が適している可能性があります。

score 0 · Accepted Answer

異常検出を実行しようとしているようですが、データの説明が曖昧です。一般に、データが「正常」であることの意味を定義/制約することから始める必要があります。

センサーごとに異なる「ノーマル」はありますか？
センサーの測定値は、以前の測定値に何らかの形で依存していますか?
「普通」は一日のうちに変化しますか？
センサーからの「通常の」測定値を統計モデルで特徴付けることができますか (たとえば、データはガウス型または対数正規型ですか)?

これらのタイプの質問に答えたら、データベースからのデータのバッチを使用して分類器または異常検出器をトレーニングし、その結果を使用して将来のログ出力を評価できます。機械学習アルゴリズムがデータに適用できる場合は、scikit-learn の使用を検討してください。統計モデルの場合、 SciPystatsのサブパッケージを使用できます。もちろん、Python で数値データを操作する場合は、NumPyが役に立ちます。

python - 傾向の分析と異常な行動の特定

2 に答える 2

Related

Reference