次の例を検討してください。
tf_vectorizer = CountVectorizer(max_df=1, min_df=0,
max_features=None,
stop_words=None)
all_docs = ['ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 0 PortA Unknown 755 0 45300 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 2 PortC Unknown 774 0 46440 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ETH:0x0000 00:17:A4:77:9C:0A 09:00:2B:00:00:05 0 PortA Unknown 752 0 45120 ETH FirstHourDay_21 LastHourDay_23 duration_6913 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_127 ThreatCategory_23 True Anomaly_True']
tf_v = tf_vectorizer.fit(all_docs)
得られた語彙は次のとおりです。
{'0a': 0,
'185': 1,
'239': 2,
'45120': 3,
'45300': 4,
'46440': 5,
'752': 6,
'755': 7,
'774': 8,
'93': 9,
'duration_6913': 10,
'threatcategory_23': 11,
'threatscore_127': 12}
などの単語が語彙から欠落していETH, FirstHourDay_22, Anomaly_True
ます。
どうしてこれなの?どうすれば完全な語彙を持つことができますか?
編集:エラーはおそらくtoken_pattern
CountVectorizer の値が原因です
編集:次の変数の問題を再考することをお勧めします:
all_docs=['ETH0x0000 0017A4779C04 09002B000005 0 PortA Unknown 755 0 45300 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ETH0x0000 0017A4779C04 09002B000005 2 PortC Unknown 774 0 46440 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ETH0x0000 0017A4779C0A 09002B000005 0 PortA Unknown 752 0 45120 FirstHourDay21 LastHourDay23 duration6913 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue',
'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue',
'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore127 ThreatCategory23 True AnomalyTrue']