2

Spark を使用して saveAsTextFile を S3 に保存した後、Hadoop のようにフォーマットします。そのようなバケット内のファイルの形式。

ここに画像の説明を入力

形式は YEAR/MONTH/DATE/TIMESTAMP で、データは part-0000 ファイルは json 形式です。

ドリルを構成し、バケット名を指定します

{
  "type": "file",
  "enabled": true,
  "connection": "s3://com.giaosudau.win-bid",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    }
  }
}

ストレージは問題ありません。しかし、日付、時間、またはファイルのデータを選択して取得する方法がわかりません。

ここに画像の説明を入力

私はまた、ファイルを選択するようなことを試みますが、成功しません:

ここに画像の説明を入力

ファイルサンプルはこちら

{"auctionId":"xx","bidRequestString":{"id":"xx","timestamp":"2015-07-29T08:31:00.413Z","isTest":false,"url":"http://www.222.3232.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.34343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"332","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=34343\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"23848834:strategy","biddingMainAccount":"1212312","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"123232","xchg":"9849839"},"winPrice":"100USD/1M"}
{"auctionId":"343","bidRequestString":{"id":"344","timestamp":"2015-07-28T08:31:00.413Z","isTest":false,"url":"http://www.xx.xx.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"342342342","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=xxx\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"829838828928932:strategy","biddingMainAccount":"3434","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"cc","xchg":"cc"},"winPrice":"100USD/1M"}

そこからデータを取得する方法は?

4

1 に答える 1

2

s3 の代わりにデータとローカル ファイル システムを使用する例を次に示します。ファイルには拡張子がないと思いますが、そうでない場合、ファイルに .json 拡張子が付いている場合は、次のようにクエリを実行できます (あいまいさを解決するには、テーブル エイリアス t を使用します)。

use dfs.`root`;
+-------+---------------------------------------+
|  ok   |                summary                |
+-------+---------------------------------------+
| true  | Default schema changed to [dfs.root]  |
+-------+---------------------------------------+
1 row selected (0.079 seconds)
select t.auctionID from dfs.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.093 seconds)

ファイルに拡張子がない場合は、ストレージ プラグインの構成を変更して、defaultInputFormat を「json」に設定します。それ以外の場合、Drill はファイルがデフォルトで Parquet 形式であると想定します。たとえば、ルート ワークスペースで defaultInputFormat を設定します。

{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
. . .

拡張子のないファイルを照会する場合、defaultInputFormat を定義したワークスペース (この例ではルート) を指定する必要があります。

select t.auctionID from dfs.`root`.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.095 seconds)

タイプの違いの処理に関するドキュメントを参照してください: https://drill.apache.org/docs/json-data-model/#handling-type-differences

于 2015-07-30T22:22:39.483 に答える