scala - Spark の機能ベクトルに IndexToString を適用する

Question

コンテキスト: StringIndexer を使用してすべてのカテゴリ値にインデックスが付けられたデータフレームがあります。

val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name }    

val categoryIndexers = categoricalColumns.map {
  col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed") 
}

次に、VectorAssembler を使用して、すべての特徴列 (インデックス付きのカテゴリ列を含む) をベクトル化しました。

val assembler = new VectorAssembler()
    .setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns))
    .setOutputCol("features")

分類器といくつかの追加手順を適用すると、ラベル、特徴、および予測を含むデータフレームが完成します。インデックス付きの値を元の文字列形式に変換するために、特徴ベクトルを個別の列に拡張したいと考えています。

val categoryConverters = categoricalColumns.zip(categoryIndexers).map {
colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels)
}

質問:これを行う簡単な方法はありますか、または何らかの方法で予測列をテストデータフレームにアタッチする最善の方法はありますか?

私が試したこと：

val featureSlicers = categoricalColumns.map {
  col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed"))
}

これを適用すると、必要な列が得られますが、それらは Vector 形式 (意図されているとおり) であり、Double 型ではありません。

編集： 目的の出力は、元のデータフレーム（つまり、インデックスではなく文字列としてのカテゴリ機能）であり、予測されたラベル（私の場合は0または1）を示す追加の列があります。

たとえば、分類子の出力が次のようになったとします。

+-----+---------+----------+
|label| features|prediction|
+-----+---------+----------+
|  1.0|[0.0,3.0]|       1.0|
+-----+---------+----------+

各機能に VectorSlicer を適用すると、次のようになります。

+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|        [0.0]|        [3.0]|
+-----+---------+----------+-------------+-------------+

これは素晴らしいことですが、次のものが必要です。

+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|         0.0 |         3.0 |
+-----+---------+----------+-------------+-------------+

次に、IndexToString を使用して次のように変換できるようにします。

+-----+---------+----------+-------------+-------------+
|label| features|prediction|    status   |    artist   |
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|        good |  Pink Floyd |
+-----+---------+----------+-------------+-------------+

あるいは：

+-----+----------+-------------+-------------+
|label|prediction|    status   |    artist   |
+-----+----------+-------------+-------------+
|  1.0|       1.0|        good |  Pink Floyd |
+-----+----------+-------------+-------------+

score 4 · Accepted Answer

あまり便利な操作ではありませんが、必要な情報を列のメタデータと単純な UDF を使用して抽出できるはずです。あなたのデータは、次のようなパイプラインで作成されていると思います。

import org.apache.spark.ml.feature.{VectorSlicer, VectorAssembler, StringIndexer}
import org.apache.spark.ml.Pipeline

val df = sc.parallelize(Seq(
  (1L, "a", "foo", 1.0), (2L, "b", "bar", 2.0), (3L, "a", "bar", 3.0)
)).toDF("id", "x1", "x2", "x3")

val featureCols = Array("x1", "x2", "x3")
val featureColsIdx = featureCols.map(c => s"${c}_i")

val indexers = featureCols.map(
  c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_i")
)

val assembler = new VectorAssembler()
  .setInputCols(featureColsIdx)
  .setOutputCol("features")

val slicer = new VectorSlicer()
  .setInputCol("features")
  .setOutputCol("string_features")
  .setNames(featureColsIdx.init)


val transformed = new Pipeline()
  .setStages(indexers :+ assembler :+ slicer)
  .fit(df)
  .transform(df)

まず、機能から必要なメタデータを抽出できます。

val meta = transformed.select($"string_features")
  .schema.fields.head.metadata
  .getMetadata("ml_attr") 
  .getMetadata("attrs")
  .getMetadataArray("nominal")

使いやすいものに変換します

case class NominalMetadataWrapper(idx: Long, name: String, vals: Array[String])

// In general it could a good idea to make it a broadcast variable
val lookup = meta.map(m => NominalMetadataWrapper(
  m.getLong("idx"), m.getString("name"), m.getStringArray("vals")
))

最後に小さな UDF:

import scala.util.Try

val transFeatures = udf((v: Vector) => lookup.map{
  m => Try(m.vals(v(m.idx.toInt).toInt)).toOption
})

transformed.select(transFeatures($"string_features")).

scala - Spark の機能ベクトルに IndexToString を適用する

1 に答える 1

Related

Reference