scala - Spark を使用して頂点からエッジを作成する

Question

頂点の配列があり、各頂点が次の x 頂点に接続する方法でそれらからエッジを作成したいとしましょう。x には任意の整数値を指定できます。Sparkでそれを行う方法はありますか?

これは私がScalaでこれまでに持っているものです:

//array that holds the edges
    var edges = Array.empty[Edge[Double]]
    for(j <- 0 to vertices.size - 2) {
      for(i <- 1 to x) {
        if((j+i) < vertices.size) {
          //add edge
          edges = edges ++ Array(Edge(vertices(j)._1, vertices(j+i)._1, 1.0))
          //add inverse edge, we want both directions
          edges = edges ++ Array(Edge(vertices(j+i)._1, vertices(j)._1, 1.0))
        }
      }
    }

ここで、頂点変数は (Long, String) の配列です。しかし、プロセス全体はもちろんシーケンシャルです。

編集：

たとえば、次のような頂点があるとHelloします。次の辺が必要です: , , , , -> , , , , , ,など。WorldandPlanet cosmosHello -> WorldWorld -> HelloHello -> andand -> HelloHelloPlanetPlanet -> HelloWorld -> andand -> WorldWorld -> PlanetPlanet -> WorldWorld -> cosmoscosmos -> World

score 3 · Accepted Answer

このようなことを意味しますか？

// Add dummy vertices at the end (assumes that you don't use negative ids)
(vertices ++ Array.fill(n)((-1L, null))) 
  .sliding(n + 1) // Slide over n + 1 vertices at the time
  .flatMap(arr => { 
     val (srcId, _) = arr.head // Take first
     // Generate 2n edges
     arr.tail.flatMap{case (dstId, _) => 
       Array(Edge(srcId, dstId, 1.0), Edge(dstId, srcId, 1.0))
     }}.filter(e => e.srcId != -1L & e.dstId != -1L)) // Drop dummies
  .toArray

RDD で実行する場合は、最初のステップを次のように調整するだけです。

import org.apache.spark.mllib.rdd.RDDFunctions._

val nPartitions = vertices.partitions.size - 1

vertices.mapPartitionsWithIndex((i, iter) =>
  if (i == nPartitions) (iter ++ Array.fill(n)((-1L, null))).toIterator
  else iter)

そしてもちろん落としtoArrayます。円形の接続 (テールをヘッドに接続) が必要な場合は、に置き換えArray.fill(n)((-1L, null))てvertices.take(n)ドロップできますfilter。

score 2 · Accepted Answer

だから、これはあなたが望むものを手に入れると思います：

最初に、小さなヘルパー関数を定義します (視覚的に検査しやすいように、ここではエッジデータを頂点名に設定していることに注意してください)。

def pairwiseEdges(list: List[(Long, String)]): List[Edge[String]] = {
  list match {
    case x :: xs => xs.flatMap(i => List(Edge(x._1, i._1, x._2 + "--" + i._2), Edge(i._1, x._1, i._2 + "--" + x._2))) ++ pairwiseEdges(xs)
    case Nil => List.empty
  }
}

配列に対してa を実行しzipWithIndexてキーを取得し、配列を RDD に変換します。

val vertices = List((1L,"hello"), (2L,"world"), (3L,"and"), (4L, "planet"), (5L,"cosmos")).toArray
val indexedVertices = vertices.zipWithIndex
val rdd = sc.parallelize(indexedVertices)

そして、エッジを生成するにはx=3:

val edges = rdd
  .flatMap{case((vertexId, name), index) => for {i <- 0 to 3; if (index - i) >= 0} yield ((index - i, (vertexId, name)))}
  .groupByKey()
  .flatMap{case(index, iterable) => pairwiseEdges(iterable.toList)}
  .distinct()

編集：コメントで@ zero323が提案したように、を書き直し、flatmap削除しました。filter

これにより、次の出力が生成されます。

Edge(1,2,hello--world))
Edge(1,3,hello--and))
Edge(1,4,hello--planet)

Edge(2,1,world--hello)
Edge(2,3,world--and)
Edge(2,4,world--planet)
Edge(2,5,world--cosmos)

Edge(3,1,and--hello)
Edge(3,2,and--world)
Edge(3,4,and--planet)
Edge(3,5,and--cosmos)

Edge(4,1,planet--hello)
Edge(4,2,planet--world)
Edge(4,3,planet--and)
Edge(4,5,planet--cosmos)

Edge(5,2,cosmos--world)
Edge(5,3,cosmos--and)
Edge(5,4,cosmos--planet)

scala - Spark を使用して頂点からエッジを作成する

2 に答える 2

Related

Reference