sql - Select query with join in huge table taking over 7 hours

Question

Our system is facing performance issues selecting rows out of a 38 million rows table.

This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.

The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:

code             ca   up_date
---------------- --   --------
1234567890123456 CL   01/01/09
1234567890123456 CL   01/01/10
1234567890123456 CL   01/01/11
1234567890123456 CL   01/01/12
6543210987654321 SU   01/01/10
6543210987654321 SU   08/03/11

Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:

invoice_no serial_no emission code             ca
---------- --------- -------- ---------------- --
1234567890 12345     05/02/12 1234567890123456 CL

My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).

So here's my query (in Oracle):

SELECT
  CL.CODE,
  CL.CATEGORY,
  -- other address fields
FROM
  CLIENTS_SUPPLIERS CL
  INVOICES I
WHERE
  CL.CODE = I.CODE AND
  CL.CATEGORY = I.CATEGORY AND
  CL.UP_DATE = 
    (SELECT
       MAX(CL2.UP_DATE)
     FROM
       CLIENTS_SUPPLIERS CL2
     WHERE
       CL2.CODE = I.CODE AND
       CL2.CATEGORY = I.CATEGORY AND
       CL2.UP_DATE <= I.EMISSION
    ) AND
  I.EMISSION BETWEEN DATE1 AND DATE2

It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.

It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.

But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).

I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?

Thanks in advance!

score 1 · Accepted Answer

SELECT   
  CL2.CODE,
  CL2.CATEGORY,
  ... other fields
FROM 
  CLIENTS_SUPPLIERS CL2 INNER JOIN (
    SELECT DISTINCT
      CL.CODE,
      CL.CATEGORY,
      I.EMISSION
    FROM
      CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
    WHERE
      I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
  CL2.UP_DATE <= CL3.EMISSION
GROUP BY
  CL2.CODE,
  CL2.CATEGORY
HAVING
  CL2.UP_DATE = MAX(CL2.UP_DATE)

アイデアは、プロセスを分離することです。最初に、必要な期間の請求書があるクライアントのリストを提供するようにオラクルに指示し、次にそれらの最新バージョンを取得します。あなたのバージョンでは、MAX 38000000 回に対するチェックがあります。これは、クエリに費やされた時間のほとんどが費やされたものだと思います。

ただし、インデックスが正しく設定されていると仮定して、インデックスを求めているわけではありません...

score 1 · Accepted Answer

DBA は何を言わなければなりませんか?

彼/彼女は試しましたか:

表領域の結合
並列クエリスレーブの増加
別の物理ディスク上の別のテーブルスペースへのインデックスの移動
関連するテーブル/インデックスに関する統計の収集
説明計画の実行
インデックスオプティマイザーを介してクエリを実行する

SQL が完璧であると言っているわけではありませんが、時間の経過とともにパフォーマンスが低下している場合は、DBA が実際に確認する必要があります。

score 0 · Accepted Answer

The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.

Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):

create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from 
(
  select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
  from clients c
)
where rnum = 1;

Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.

The snapshot table for this view should be MUCH smaller than the full clients table with all the history.

Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:

select cv.*
from 
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;

Hope that helps.

score 0 · Accepted Answer

相関サブクエリではなく分析関数を使用するようにクエリを書き直してみてください。

select *
from (SELECT CL.CODE, CL.CATEGORY,   -- other address fields
             max(up_date) over (partition by cl.code, cl.category) as max_up_date
      FROM CLIENTS_SUPPLIERS CL join
           INVOICES I
           on CL.CODE = I.CODE AND
              CL.CATEGORY = I.CATEGORY and
              I.EMISSION BETWEEN DATE1 AND DATE2 and
              up_date <= i.emission
     ) t
where t.up_date = max_up_date

外部選択で max_up_date 列を削除することをお勧めします。

お気づきの方もいらっしゃると思いますが、このクエリは元のクエリとは微妙に異なります。すべての日付で最大の up_date を使用しているためです。元のクエリには次の条件があります。

CL2.UP_DATE <= I.EMISSION

ただし、推移性によって、これは次のことを意味します。

CL2.UP_DATE <= DATE2

したがって、唯一の違いは、更新日の最大値が元のクエリの DATE1 よりも小さい場合です。ただし、これらの行は UP_DATE との比較によって除外されます。

このクエリの表現は少し異なりますが、同じことをしていると思います。これは私がよく知らないデータに関する微妙な状況であるため、100% 肯定的ではないことを認めなければなりません。

score 0 · Accepted Answer

(code,ca) の行数が少ないと仮定すると、次のようなインラインビューを使用して、請求書ごとにインデックススキャンを強制しようとします。

SELECT invoice_id, 
       (SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
          FROM clients_suppliers c
         WHERE c.code = i.code
           AND c.category = i.category
           AND c.up_date < i.invoice_date)
  FROM invoices i
 WHERE i.invoice_date BETWEEN :p1 AND :p2

次に、このクエリをCLIENTS_SUPPLIERS結合して、行 ID を介して結合をトリガーすることをお勧めします (300k 行 ID の読み取りは無視できます)。

SQL オブジェクトを使用すると、上記のクエリを改善できます。

CREATE TYPE client_obj AS OBJECT (
   name     VARCHAR2(50),
   add1     VARCHAR2(50),
   /*address2, city...*/
);

SELECT i.o.name, i.o.add1 /*...*/
  FROM (SELECT DISTINCT
               (SELECT client_obj(
                         max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
                         max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
                         /*city...*/
                       ) o
                  FROM clients_suppliers c
                 WHERE c.code = i.code
                   AND c.category = i.category
                   AND c.up_date < i.invoice_date)
          FROM invoices i
         WHERE i.invoice_date BETWEEN :p1 AND :p2) i

sql - Select query with join in huge table taking over 7 hours

5 に答える 5

Related

Reference