ruby-on-rails - Ruby の並列 csv インポート

Question

巨大なcsvファイルをインポートしていますが、それを分割して、インポートが高速になるようにしたいです（dbに直接インポートしませんでした。計算があります）。コードは次のようになります。

def import_shatem
    require 'csv'





    CSV.foreach("/#{Rails.public_path}/uploads/hshatem2.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |

      @eur_cur = Currency.find_by_currency_name("EUR")
      abrakadabra = row[0].to_s()
      (ename,esupp) = abrakadabra.split(/_/)
      eprice = row[6].to_f / @eur_cur.currency_value
      eqnt = /(\d+)/.match(row[1])[0].to_f


        if ename.present? && ename.size>3
        search_condition = "*" + ename.upcase + "*"     

        if esupp.present?
          #supplier = @suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
          supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
          logger.warn("!!! *** supp !!!")

        end

        if supplier.present?

          @search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE) and ARL_KIND = 1', search_condition.gsub(/[^0-9A-Za-z]/, '')])
          @articles = Article.find(:all, :conditions => { :ART_ID => @search.map(&:ARL_ART_ID)})
          #@art_concret = @articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }

          @aa = @articles.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| @articles
          if @aa.present?
            @art = Article.find_by_ART_ID(@aa)
          end

          if @art.present?
            #require 'time_diff'
            #cur_time = Time.now.strftime('%Y-%m-%d %H:%M')
            #time_diff_components = Time.diff(@art.datetime_of_update, Time.parse(cur_time))
            limit_time = Time.now + 3.hours
            if  (@art.PRICEM.to_f >= eprice.to_f || @art.PRICEM.blank? ) #&& @art.datetime_of_update >= limit_time) 
              @art.PRICEM = eprice
              @art.QUANTITYM = eqnt
              @art.datetime_of_update = DateTime.now
              @art.save
            end
          end

        end     
      end
    end
  end

どうすればそれを並列化できますか？より高速なインポートを取得しますか?

score 0 · Accepted Answer

コードを見ると、ボトルネックはデータベースクエリになります。並行して実行しても、これは解決しません。代わりに、これをより効率的にできるかどうか見てみましょう。

大きな問題は、おそらく記事の検索です。複数のクエリを実行し、メモリ内を検索しています。最後に行きます。

Currency.find_by_currency_name常に同じです。ループから if を抽出します。ボトルネックになる可能性は低いですが、役に立ちます。currency_nameそして、がの列であると仮定するとCurrency、でレコード全体をロードする代わりに、単一の値を取得することで少し時間を節約できますpick。

  def currency_value
    @currency_value ||= Currency.where(currency_name: "EUR").pick(:currency_value)
  end

同様に、Supplier.whereCSV に多くの繰り返し値が含まれる場合は、キャッシュが有効です。Memoistで戻り値をキャッシュします。

  extend Memoist

  private def find_supplier_for_esupp(esupp)
    return if esupp.blank?
    Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
  end
  memoize :find_supplier_for_esupp

%term%は通常の B ツリーインデックスを使用しないため、Supplier テーブルの大きさによっては検索が遅くなる場合があります。PostgreSQL を使用している場合は、トライグラムインデックスを使用してこのクエリを高速化できます。

add_index :suppliers, :SUP_BRAND, using: 'gin', opclass: :gin_trgm_ops

最後に、記事の検索がおそらく最大のボトルネックです。ArtLookup にクエリを実行し、すべてのレコードをロードして、1 つの列に対してすべてをスローします。次に Article を検索し、それらすべてをメモリにロードし、それらをメモリでフィルタリングし、最後にもう一度 Article を検索します。

モデルで Article と ArtLookup の関係が適切に設定されていると仮定すると、これは 1 つのクエリに削減できます。

  art = Article
    .joins(:art_lookups)
    .merge(
      ArtLookup
        .where(ARL_KIND: 1)
        .where(
          'MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)',
          search_condition
        )
    )
    .where(
      ART_SUP_ID: supplier.SUP_ID
    )
    .first

それはかなり速くなるはずです。

全体として、これらのネストされたすべての if を回避するためのアーリーリターンなどの他のいくつかの改善があります。

require 'csv'

class ShatemImporter
  extend Memoist

  # Cache the possibly expensive query to find suppliers.
  private def find_supplier_for_esupp(esupp)
    Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
  end
  memoize :find_supplier_for_esupp

  # Cache the currency value query outside the loop.
  private def currency_value
    @currency_value ||= Currency.find_by(currency_name: "EUR").currency_value
  end

  def import_shatem(csv_file)
    CSV.foreach(
      csv_file,
      {
        encoding: 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row
      }
    ) do |row|
      (ename,esupp) = row[0].to_s().split(/_/)
      eprice = row[6].to_f / currency_value
      eqnt = row[1].match(/(\d+)/).first.to_f

      next if ename.blank? || ename.size < 4
      next if esupp.blank?
      
      supplier = find_supplier_for_esupp(esupp)      
      next if !supplier

      article = Article
        .joins(:art_lookups)
        .merge(
          ArtLookup
            .where(ARL_KIND: 1)
            .where(
              'MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)',
              "*#{ename.upcase}*"     
            )
        )
        .where(
          ART_SUP_ID: supplier.SUP_ID
        )
        .first
      next if !article

      if art.PRICEM.blank? || art.PRICEM.to_f >= eprice.to_f
        art.update!(
          PRICEM: eprice,
          QUANTITYM: eqnt,
          datetime_of_update: DateTime.now
        )
      end
    end
  end
end

これは Rails 6 で書かれており、コードは Rails 2 のように見え、テストされていません。しかし、うまくいけば、最適化の手段が得られるでしょう。

ruby-on-rails - Ruby の並列 csv インポート

2 に答える 2

Related

Reference