3

シェルスクリプトで正規表現を使用して、テキストファイルから特定のデータを抽出しようとしています

それは複数行のgrepを使用しています..そして私が使用しているツールはpcgrepであるため、perlの正規表現との互換性を得ることができます

 [58]Walid Chamoun Architects WLL
     * [59]Map
     * [60]Website
     * [61]Email
     * [62]Profile
     * [63]Display Ad

   Walid Chamoun Architects WLL

   PO Box:
          55803, Doha, Qatar

   Location:
          D-Ring Road, New Salata Shamail 40, Villa 340, Doha, Qatar

   Tel:
          (00974) 44568833

   Fax:
          (00974) 44568811

   Mob:
          (00974) 44568822

     * Accurate Budget Costing
     * Eco-Friendly Structural Design
     * Exclusive & Unique Design
     * Quality Architecture & Design

Company Profile

   Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992,
   committed to the concept of fully integrated design-build delivery of
   projects. In late '90s, company established in-house architectural and
   engineering services. As a full service provider, WCA expanded from
   multi-family projects to industrial and office construction, which
   added development services, including site acquisition and financing.
   In 2001, WCA had opportunity and facilities to experience European
   market and establish office in Puerto Banus, Marbella, Spain. By 2005,
   WCA refined its structure to focus on specific market segments and new
   office was opened in Doha, state of Qatar. From a solid foundation and
   reputation built over eighteen years, WCA continually to provide
   leadership in design-build through promotion of benefits and education
   to its practitioners.
   Project Planning: Project planning and investigation occurs before
   design begins has greatest impact on cost, schedule and ultimately the
   success of project. Creativity in Design: You can rely on our in-house
   designers for design excellence in all aspects of the project. Our
   designs have received recommendations and appreciations on national and
   international levels. Creativity in Execution: Experienced in close
   collaboration with the designers as part of the integrated team, our
   construction managers, superintendents and field staff create value
   throughout the project. Post Completion Services: Your needs can be
   served through our skills and experience long after the last
   construction crew has left the site. Performance: Corporate and
   institutional clients, developers and public agencies repeatedly select
   WCA on the basis of its consistent record of performance excellence.
   Serving clients throughout the Middle East and GCC, WCA provides
   complete planning for architectural, interior design and construction
   on a single-responsibility basis. Our expertise spans industrial,
   commercial, institutional, public and residential projects. Benefits of
   Design-Build: Design-build is a system of contracting under which one
   entity performs both design and construction. Benefits of design-build
   project delivery include: Single point responsibility Early knowledge
   of cost Time and Cost savings

   Classification:
          Architects - [64]Architects

   [65]Al Ali Consulting & Engineering
     * [66]Map
     * Website
     * Email
     * Profile
     * Display Ad

   Is this your company?
   [67]Upgrade this free listing here

   PO Box:
          467, Doha, Qatar

   Tel:
          (00974) 44360011

Company Profile

   Classification:
          Architects - [68]Architects

   [69]Al Gazeerah Consulting Engineering
     * [70]Map
     * Website
     * Email
     * Profile
     * Display Ad

   Is this your company?
   [71]Upgrade this free listing here

   PO Box:
          22414, Doha, Qatar

   Tel:
          (00974) 44352126

Company Profile

   Classification:
          Architects - [72]Architects

   [73]Al Murgab Consulting Engineering
     * [74]Map
     * Website
     * Email
     * Profile
     * Display Ad

   Is this your company?
   [75]Upgrade this free listing here

   PO Box:
          2856, Doha, Qatar

   Tel:
          (00974) 44448623

Company Profile

   Classification:
          Architects - [76]Architects
References

   Visible links
   1. http://www.qatcom.com/useraccounts/login
   2. http://www.qatcom.com/useraccounts/register
   3. http://www.qatcom.com/
   4. http://www.qatcom.com/
   5. http://www.qatcom.com/qataryellowpages/map-of-doha
   6. http://www.qatcom.com/qataryellowpages/about-qatcom
   7. http://www.qatcom.com/qataryellowpages/advertise-with-qatcom
   8. http://www.qatcom.com/qataryellowpages/advertiser_testimonials
   9. http://www.qatcom.com/useraccounts/login
  10. http://www.qatcom.com/useraccounts/register
  11. http://www.qatcom.com/contact-qatcom
  12. http://www.qatcom.com/qataryellowpages/companies
  13. http://www.qatcom.com/classifications/index/A
  14. http://www.qatcom.com/classifications/index/B
  15. http://www.qatcom.com/classifications/index/C
  16. http://www.qatcom.com/classifications/index/D
  17. http://www.qatcom.com/classifications/index/E
  18. http://www.qatcom.com/classifications/index/F
  19. http://www.qatcom.com/classifications/index/G
  20. http://www.qatcom.com/classifications/index/H
  21. http://www.qatcom.com/classifications/index/I
  22. http://www.qatcom.com/classifications/index/J
  23. http://www.qatcom.com/classifications/index/K
  24. http://www.qatcom.com/classifications/index/L
  25. http://www.qatcom.com/classifications/index/M
  26. http://www.qatcom.com/classifications/index/N
  27. http://www.qatcom.com/classifications/index/O
  28. http://www.qatcom.com/classifications/index/P

このようなサンプル データの場合、企業の詳細を取得しようとしています。

company name
po box
Tel
fax
mobile
company profile 

.csvファイルに、私は正規表現とLinuxも初めてです..私が何とか得ることができたのは、このようなものだけでした

\[\d*\][^\.]*[\(\d*\)\s\d*)]

誰でもこれで私を助けてくれます..

改善:

私はこのようなことを考え出した

$ awk '/^\[/ && ! /Upgrade this free listing/ {print $0} /:$/ && ! /Classification/ {printf $0 ;  getline x ; print x}' file

しかし、それはまだ私が望んでいるものではありません...

4

1 に答える 1

0

これはawkで行うことができますが、代わりに HTML を解析する方がよいでしょう。これを行うのに適したツールは、 Beautiful Soupモジュールを使用するPythonです。しかし、それはあまりエキサイティングではないので、厄介な (ハァッ!) 方法でそれを行う方法を次に示します。

#!/usr/bin/awk -f

function trim(s) {
    gsub(/(^ +)|( +$)/, "", s)
    return s
}

BEGIN {
    count = 0
    fields[0] = "company"
    fields[1] = "pobox"
    fields[2] = "tel"
    fields[3] = "fax"
    fields[4] = "mob"
    fields[5] = "profile"
}

# company name
/^ +\[[0-9]+\].*$/ {
    sub(/^ +\[[0-9]+\]/, "") # get rid of the Lynx reference
    # this is a bit naughty: our regex also matches this other link, but there's only one of them, so we just filter it
    if ($0 != "Upgrade this free listing here") data[count,"company"]=$0
}

# two line fields, easy!
/ +PO Box:$/ { getline; data[count,"pobox"]=$0 }
/ +Tel:$/ { getline; data[count,"tel"]=$0 }
/ +Fax:$/ { getline; data[count,"fax"]=$0 }
/ +Mob:$/ { getline; data[count,"mob"]=$0 }

# multi-line field, tricky because it can be empty
/^Company Profile$/ {
    getline # skip empty line

    # process lines until encountering Classification field
    s = ""
    do {
        s = s $0
        getline
    } while ($0 !~ / +Classification:$/)
    data[count,"profile"]=s
    count++ # the Classification field denotes the end of the company record
}

END {
    OFS=","

    # output CSV header row
    for ( key in fields ) {
        printf "\"" fields[key] "\","
    }
    printf "\n"

    # output data
    for ( i=0; i<count; i++ ) {
        for ( key in fields ) {
            printf "\"" trim(data[i,fields[key]]) "\","
        }
        printf "\n"
    }
}

parse.awkとして保存し、 で呼び出し./parse.awk < sample.txtます。次のような CSV が出力されます。

"tel","fax","mob","profile","company","pobox",
"(00974) 44568833","(00974) 44568811","(00974) 44568822","Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992,   committed to the blablabla","Walid Chamoun Architects WLL","55803, Doha, Qatar",
"(00974) 44360011","","","","Al Ali Consulting & Engineering","467, Doha, Qatar",
"(00974) 44352126","","","","Al Gazeerah Consulting Engineering","22414, Doha, Qatar",
"(00974) 44448623","","","","Al Murgab Consulting Engineering","2856, Doha, Qatar",

うまくいけば何が起こっているのかを説明するコメントがあります。これはプレーンな古い awk で実行され、派手な gawk 機能は必要ありません。awk 配列は任意に並べられることに注意してください。これは、さまざまな入力データで大量のデータを破壊する傾向があります。これは、そのような悪ふざけではなく HTML を本当に解析する必要がある多くの理由の 1 つにすぎませんlynx -dump

于 2012-06-27T02:21:01.233 に答える