html - SAS を使用して Web ページからデータをスクレイピングする方法

Question

問題文: Web からデータを取得し、SAS Program を使用して SAS データセットに入れる必要があります。

うまくいきました: SAS でターゲット Web ページのコンテンツを取得できました。

機能していません (ヘルプが必要): ページのソースコンテンツ (以下に表示) を SAS で処理できません。ソースコンテンツで「カテゴリ」を見つける必要があり、見つかった場合は、その行のすべての値 (11 月、10 月、9 月、8 月、7 月) を取得します。同様に、ソースコンテンツで「Conference Board」を検索し、見つかった場合はその行のすべての値 (96.1,101.4,101.3,86.3,91.7) を取得する必要があります。ソースコンテンツの構造は常に同じであることが期待されます。期待される出力はpng画像として添付されています。SASプログラムを使用してこのシナリオに取り組む方法を誰かが知っていて、私を助けてくれれば、それは素晴らしい学習と助けになるでしょう.

私はこのようなことを試しました：

filename output "/desktop/Response.txt";

proc http           
url="http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"       
method="get"        
proxyhost="&proxy_host."        
proxyport=&port         
out=output;     
run;

DATA CHECK;
LENGTH CATEGORY $ 5;
RETAIN CATEGORY;
INFILE output LENGTH = recLen LRECL = 32767;
INPUT line $VARYING32767. recLen;
IF FIND(line,'Category') GT 0 THEN DO;
CATEGORY = SCAN(STRIP(line),2,'<>');
OUTPUT;
END;
RUN;

Web ページのソースコンテンツ:

<table width="100%" cellpadding="2" cellspacing="0" border="0">
  <tr valign="top" align="right" class="sectionColor">
    <td class="rH" align="left">Category</td>
    <td class="rH">NOV</td>
    <td class="rH">OCT</td>
    <td class="rH">SEP</td>
    <td class="rH">AUG</td>
    <td class="rH">JUL</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rD" align="left">Conference Board</td>
    <td class="rD">96.1</td>
    <td class="rD">101.4</td>
    <td class="rD">101.3</td>
    <td class="rD">86.3</td>
    <td class="rD">91.7</td>
  </tr>
 <tr valign="top" align="right" class="sectionColor">
    <td class="rL" align="left">&nbsp;&nbsp;Expectations</td>
    <td class="rL">89.5</td>
    <td class="rL">98.2</td>
    <td class="rL">102.9</td>
    <td class="rL">86.6</td>
    <td class="rL">88.9</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rD" align="left">&nbsp;&nbsp;Present Situation</td>
    <td class="rD">105.9</td>
    <td class="rD">106.2</td>
    <td class="rD">98.9</td>
    <td class="rD">85.8</td>
    <td class="rD">95.9</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rL" align="left">Employment  ('plentiful' less 'hard to get')</td>
    <td class="rL">7.2</td>
    <td class="rL">7.1</td>
    <td class="rL">3.3</td>
    <td class="rL">-2.2</td>
    <td class="rL">2.2</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rD" align="left">1 yr inflation expectations</td>
    <td class="rD">5.7%</td>
    <td class="rD">5.6%</td>
    <td class="rD">5.7%</td>
    <td class="rD">5.8%</td>
    <td class="rD">6.1%</td>
  </tr>
</table>

SAS データセットの出力は次のようになります。

score 2 · Accepted Answer

ほとんどのスクレイピングでは、解析ライブラリまたはテーブルリッピングツールを使用します。

ただし、html が非常に規則的またはパターン化されていると信頼できる場合は、単純なテキスト処理を使用できます。

例：

filename output temp;

proc http           
url="http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"       
method="get"        
/*proxyhost="&proxy_host."        */
/*proxyport=&port         */
out=output;     
run;

data have_cells (keep=row_num name value);
  length name value $32;

  infile output _infile_=line;
  input;

  retain landmark_found 0 table_found 0 naming 1 in_row 0 row_num -1;

  if not landmark_found then  
    landmark_found = prxmatch('/Highlights/',  line);

  if not landmark_found then
    delete;

  if not table_found then
    table_found = prxmatch('/<table /', line);

  if not table_found then
    delete;

  array names(20) $8 _temporary_;

  if not in_row then
    if prxmatch('/<tr /', line) then do;
      col_index = 0;
      in_row = 1;
      row_num + 1;
      return;
    end;

  if not in_row then
    delete;
td:
  rxtd = prxparse('/<td .*?>(.*)<\/td>/');
  if prxmatch(rxtd, line) then do;
    col_index + 1;
    if naming then do;
      names(col_index) = prxposn(rxtd,1,line);
    end;
    else do;
      name = names(col_index);
      value = prxposn(rxtd,1,line);
      OUTPUT;
    end;
    return;
  end;

  in_row = not prxmatch('/<\/tr/', line);

  if naming then if not in_row then naming = 0;

  if prxmatch('/<\/table>/', line) then stop;
run;

proc transpose data=have_cells out=have_raw;
  by row_num;
  id name;
  var value;
run;

より多くのコーディングを実行する必要があります

htmldecode()特定の列
正しいサイズの文字列
他人を改宗させる
- 文字から数値へ
- 今までのキャラ

score 1 · Accepted Answer

IF (および HTML ページの場合は BIG です) ファイルはその例と同じくらいきちんとしたままであり、非常に簡単に解析できます。<table>、<tr>および<td>タグを探してください。テーブル番号、行番号、列番号の独自のカウンターを作成します。

filename output url 
  "http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"
;

data tall;
 length table row col 8 value $200;
 infile output truncover ;
 do until(left(_infile_)=:'<table'); input ; end;
 table+1;
 row=0;
 do until(left(_infile_)=:'</table');
   input;
   if left(_infile_)=:'<tr' then do; row+1; col=0; end;
   if left(_infile_)=:'<td' then do; 
      value = left(scan(_infile_,-2,'<>')); 
      col+1; 
      if value ne ' ' then output; 
   end;
 end;
run;

次に、PROC TRANSPOSE を使用して長方形の構造を作成します。おそらく最初のテーブルは必要ないようです。

proc transpose data=tall out=tables(drop=_name_) prefix=col;
  where table=2;
  by table row;
  id col;
  var value;
run;

しかし、おそらくそれを転置したいと思うでしょう。したがって、PROC TRANSPOSE ステップの前に再ソートします。

proc sort data=tall;
  by table col row;
run;
proc transpose data=tall out=tables(drop=_name_) prefix=col;
  where table=2;
  by table col;
  id row;
  var value;
run;

結果

html - SAS を使用して Web ページからデータをスクレイピングする方法

3 に答える 3

Related

Reference