python - Cloudera VM でチュートリアルの CSV ファイルを読み取るときの例外

Question

Cloudera Virtual Machine に付属する Spark チュートリアルを実行しようとしています。しかし、正しい行末エンコーディングを使用していても、大量のエラーが発生するため、スクリプトを実行できません。このチュートリアルは、Courseraビッグデータ分析入門コースの一部です。割り当てはここにあります。

だからここに私がやったことです。IPython シェルをインストールします (まだ完了していない場合)。

sudo easy_install ipython==1.2.1

シェルを開く/起動します (1.2.0 または 1.4.0 のいずれか):

PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.2.0

行末を Windows スタイルに設定します。これは、ファイルが Windows エンコーディングであり、コースでそうするように言われているためです。これを行わないと、他のエラーが発生します。

sc._jsc.hadoopConfiguration().set('textinputformat.record.delimiter','\r\n')

CSV ファイルを読み込もうとしています:

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',header = 'true',inferSchema = 'true',path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

しかし、次のように始まるエラーの非常に長いリストを取得します。

Py4JJavaError: An error occurred while calling o23.load.: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)

完全なエラーメッセージは、ここで確認できます。これは /etc/hive/conf/hive-site.xml です

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
  <!-- that are implied by Hadoop setup variables.                                                -->
  <!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
  <!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
  <!-- resource).                                                                                 -->

  <!-- Hive Execution Parameters -->

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>cloudera</value>
  </property>

  <property>
    <name>hive.hwi.war.file</name>
    <value>/usr/lib/hive/lib/hive-hwi-0.8.1-cdh4.0.0.jar</value>
    <description>This is the WAR file with the jsp content for Hive Web Interface</description>
  </property>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>

  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>false</value>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://127.0.0.1:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>
</configuration>

それを解決する方法やアイデアはありますか？よくあるエラーだと思います。しかし、私はまだ解決策を見つけることができませんでした。

もう 1 つ: このような長いエラーメッセージを別のログファイルにダンプする方法はありますか?

score 0 · Accepted Answer

2つの問題があるようです。まず、hive-metastore がオフラインになることがありました。次に、スキーマを推測できません。そのため、スキーマを手動で作成し、CSV ファイルをロードするときに引数として追加しました。とにかく、これが schemaInfer=true で何らかの形で機能するかどうかを理解したいと思います。

これは、手動で定義されたスキーマを使用した私のバージョンです。したがって、ハイブが開始されていることを確認してください。

sudo service hive-metastore restart

次に、CSV ファイルの最初の部分を調べて、その構造を理解します。このコマンドラインを使用しました：

head /usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv

次に、python シェルを開きます。その方法については、元の投稿を参照してください。次に、スキーマを定義します。

from pyspark.sql.types import *
schema = StructType([
    StructField("business_id", StringType(), True),
    StructField("cool", IntegerType(), True),
    StructField("date", StringType(), True),
    StructField("funny", IntegerType(), True),
    StructField("id", StringType(), True),
    StructField("stars", IntegerType(), True),
    StructField("text", StringType(), True),
    StructField("type", StringType(), True),
    StructField("useful", IntegerType(), True),
    StructField("user_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("full_address", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
    StructField("neighborhood", StringType(), True),
    StructField("open", StringType(), True),
    StructField("review_count", IntegerType(), True),
    StructField("state", StringType(), True)])

次に、スキーマを指定して CSV ファイルを読み込みます。Windows の行末を設定する必要がないことに注意してください。

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
schema = schema,
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

データセットに対して実行された任意のメソッドによる結果。カウントを取得しようとしましたが、完全に機能しました。

yelp_df.count()

@yaron の助けのおかげで、inferSchema を使用して CSV をロードする方法を理解できました。まず、hive-metastore を正しくセットアップする必要があります。

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

次に、Python シェルを開始します。行末を Windows エンコーディングに変更しないでください。その変更は永続的であることに注意してください (セッション不変)。そのため、以前に Windows スタイルに変更した場合は、'\n' にリセットする必要があります。次に、inferSchema を true に設定して CSV ファイルを読み込みます。

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
inferSchema = 'true',
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

score 0 · Accepted Answer

ディスカッションの要約: 次のコマンドを実行すると、問題が解決しました。

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

詳細については、 https://www.coursera.org/learn/bigdata-analytics/supplement/tyH3p/setup-pyspark-for-dataframesを参照してください。

python - Cloudera VM でチュートリアルの CSV ファイルを読み取るときの例外

2 に答える 2

Related

Reference