python - Spark Java ヒープエラー

Question

ここで何が起こっているのか、そしてその理由はわかりません。

パンダとスパークデータフレームの両方としてロードされるデータフレームがあります。

データフレームはまばらで、ほとんどがゼロです。寸法は 56K X 9K です。だからそんなに大きくない

また、次のコマンドを spark/conf/spark-defaults.conf ファイルに入れました

spark.driver.memory              8g
spark.executor.memory            2g
spark.driver.maxResultSize       2g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value

spark.jars.packages    com.databricks:spark-csv_2.11:1.4.0

ご覧のとおり、Driver に 8GB、Executor に 2G を既に割り当てています。Macbook Pro にローカルにインストールされた Spark を使用しています。

私がする時

recommender_ct.show()

最初の 5 行を確認するには、次のようにします。

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-7-8c71bfcdfd03> in <module>()
----> 1 recommender_ct.show()

/Users/i854319/spark/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
    255         +---+-----+
    256         """
--> 257         print(self._jdf.showString(n, truncate))
    258 
    259     def __repr__(self):

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o40.showString.
: java.lang.OutOfMemoryError: Java heap space

このデータフレームは、以下のように Spark データフレームのクロス集計を使用して作成されました。

recommender_ct=recommender_sdf.crosstab('TRANS','ITEM')

.show() が使用されている場合、recommender_sdf の上の Spark データフレームは正常に動作します。

The same cross tab method is used for pandas data frame and when I do below it works very fine.

# Creating a new pandas dataframe for cross-tab
recommender_pct=pd.crosstab(recommender_pdf['TRANS'], recommender_pdf['ITEM'])

recommender_pct.head()

This works immediately.

So that shows that the file is easily able to get loaded in memory and can be used by pandas, but the same data frame in spark when used .show() or .head() is throwing the java heap error. And it is taking lot of time before throwing the error.

I don't understand why is this happening. Isn't Spark supposed to be faster than pandas and shouldn't have this memory issue when same data frame can be easily accessed and printed using pandas.

EDIT:

Ok. The cross-tabbed spark data frame looks like this when I fetch first few rows and columns from the corresponding pandas data frame

    TRANS   Bob Iger: How Do Companies Foster Innovation and Sustain Relevance  “I Will What I Want” - Misty Copeland   "On the Lot" with Disney Producers  "Your Brain is Good at Inclusion...Except When it's Not" with Dr. Steve Robbins (please do not use) WDW_ER-Leaders of Minors    1. EAS Project Lifecycle Process Flow   10 Life Lessons from Star Wars  10 Simple Steps to Exceptional Daily Productivity   10 Steps to Effective Listening
0   353 0   0   0   0   0   0   0   0   0
1   354 0   0   0   0   0   0   0   0   0
2   355 0   0   0   0   0   0   0   0   0
3   356 0   0   0   0   0   0   0   0   0
4   357 0   0   0   0   0   0   0   0   0

The column names are basically long text strings. And the column values are either 0 or 1

score 0 · Accepted Answer

Java で同じ問題をどのように解決したか: 実行する必要があるクエリを 2 つ (またはそれ以上) の部分に分割します。前半を実行し、結果を HDFS に保存します (parquet として)。2 番目の SqlContext を作成し、前半の結果を HDFS から読み取り、後半のクエリを実行します。

python - Spark Java ヒープ エラー

1 に答える 1

Related

Reference

python - Spark Java ヒープエラー