ruby - S3 でホストされているファイルの行数を取得する

Question

ユーザーがファイルを S3 にアップロードできるようにすると、そのファイルの行数が表示されます。これを行うには、S3 からファイルをフェッチし、ドキュメント内の改行数をカウントするバックグラウンドプロセス (DelayedJob) を実行します。一般に、これはかなりうまく機能します。

作業を行うコードは次のとおりです。

  def self.line_count_from_s3(options={})

    options = { :key => options } if options.is_a?(String)

    line_count = 0

    unless options[:key]
      raise ArgumentError, 'A valid S3 key is required.'
    end

    s3 = AWS::S3.new
    file = s3.buckets[ENV['S3_BUCKET']].objects[options[:key]]

    unless file.exists?
      raise IOError, 'Unable to load that import from S3. Key does not exist.'
    end

    # Stream download chunks of the file instead of loading it all into memory
    file.read do |chunk|
      # Normalize line endings
      chunk.gsub!(/\r\n?/, "\n")
      line_count += chunk.scan("\n").count
    end
    # Don't count the empty newline (assumes there is one)
    line_count -= 1 if line_count > 0

    line_count
  end

何らかの理由で、いくつかのファイルの行数が完全に間違っています。たとえば、10,000 行のファイルが 40,000 の行数で表示されます。これは一貫していません。ほとんどのファイルは問題なく動作します。

これが S3 のチャンクリーダーの動作方法に起因するのか、それとも他の何かが問題を引き起こしているのかを突き止めようとしています。レコード数が間違っている理由は何ですか? 私が気付いていないこれを行うためのより良い方法はありますか？

ruby - S3 でホストされているファイルの行数を取得する

1 に答える 1

Related

Reference