1

次の例を検討してください。

tf_vectorizer = CountVectorizer(max_df=1, min_df=0,
                                max_features=None,
                                stop_words=None)

all_docs = ['ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 0 PortA Unknown 755 0 45300 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
            'ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 2 PortC Unknown 774 0 46440 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
 'ETH:0x0000 00:17:A4:77:9C:0A 09:00:2B:00:00:05 0 PortA Unknown 752 0 45120 ETH FirstHourDay_21 LastHourDay_23 duration_6913 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
 'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
 'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
 'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_127 ThreatCategory_23 True Anomaly_True']

tf_v = tf_vectorizer.fit(all_docs)

得られた語彙は次のとおりです。

{'0a': 0,
 '185': 1,
 '239': 2,
 '45120': 3,
 '45300': 4,
 '46440': 5,
 '752': 6,
 '755': 7,
 '774': 8,
 '93': 9,
 'duration_6913': 10,
 'threatcategory_23': 11,
 'threatscore_127': 12}

などの単語が語彙から欠落していETH, FirstHourDay_22, Anomaly_Trueます。

どうしてこれなの?どうすれば完全な語彙を持つことができますか?

編集:エラーはおそらくtoken_patternCountVectorizer の値が原因です

編集:次の変数の問題を再考することをお勧めします:

all_docs=['ETH0x0000 0017A4779C04 09002B000005 0 PortA Unknown 755 0 45300 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse', 'ETH0x0000 0017A4779C04 09002B000005 2 PortC Unknown 774 0 46440 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse', 'ETH0x0000 0017A4779C0A 09002B000005 0 PortA Unknown 752 0 45120 FirstHourDay21 LastHourDay23 duration6913 ThreatScorenan ThreatCategorynan False AnomalyFalse', 'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue', 'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue', 'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore127 ThreatCategory23 True AnomalyTrue']

4

0 に答える 0