I've got a complex SQL where clause that just got more complex because of a requirements change. There are four basic sets of cases, each with a different combination of other factors. It's more readable (in my opinion) to have the four cases as separate branches of the where clause, and to repeat the redundant criteria in each branch. But I don't know how well the database engine will optimize that.
Here's the expression in its redundant form. I've replaced the actual criteria with letters. A is the "branching" criterion, which comes in four forms. All of the expressions are in the form field='value'
, unless otherwise noted.
A1 AND B AND C AND D
OR A2 AND B AND C AND D AND E AND F1 AND G
OR A3 AND A3a AND B AND C AND D AND E AND F1 AND G
OR A4 AND B AND C AND D AND F2
All of the A's except A4 are in the form field in ('value1','value2')
. D is field > 'value'
. G is in the form field not in (subquery)
.
Here's the expression, factored to (I think) its least redundant form.
B AND C AND D AND (
A1
OR (
E AND F1 AND G AND (
A2
OR (A3 AND A3a)
)
)
OR (A4 AND F2)
My question is whether I should factor this expression into its simplest (least redundant) logical form, or whether it's OK to keep it in it's more redundant but also more readable form. The target database is Sybase, but I'd like to know the answer for RDMBSs generally.
-
I would refactor it. Eventually, duplicated logic will run you into problems. The second example may take a couple seconds longer to understand, but in the big scope of things it's easier to see what is going on as you can quickly look at the entire where clause and begin to determine what affects what.
-
If I were attacking this problem on M$ SQL Server, I would write it the way I wanted and look at the query execution plan. If it (a) ran slow and (b) had a bad execution plan, then I would refactor and document. Not sure what the mechanism is in Sybase for viewing HOW the optimizer executes the query.
John M Gant : Thanks for the post. In Sybase you can run SET SHOWPLAN ON before you run your query and it will show you details (although in a text-based format and not a graphical representation like SQL Server's) -
In an
RDBMS
world I wouldn't bother of redundancy much, efficiency is more important here.In your case, I would
UNION
all the four queries usingA
's as a top condition, like this:SELECT * FROM mytable WHERE A1 AND B AND C UNION SELECT * FROM mytable WHERE A2 AND B AND C AND D AND E AND F1 AND G …
I didn't look into
Sybase
for more than7
years, but in all majorRDBMS
'sUNION
's are more efficient thanOR
's.See this article in my blog for approach to a silimar problem in
Oracle
:- Selecting lowest value: efficient usage of
UNION
inOracle
and also this article for comparison of
UNION
versusOR
inMySQL
:- Selecting friends: efficiency of
UNION
compared to one ofOR
inMySQL
I think these approaches will work well for
Sybase
too.You also need to create indexes on columns used in your conditions to benefit from
UNION
'sUpdate:
Since condition
G
is a subquery, it may probably happen that it needs aHASH JOIN
to perform fast.HASH JOIN
requires a full scan on all unfiltered values, that's why it may probable be better to filter all values in a single fullscan and then perform aHASH JOIN
:SELECT * FROM ( SELECT * FROM foo WHERE condition_set_1 UNION SELECT * FROM foo WHERE condition_set_2_but_no_g … ) q WHERE G
To make any further judgements, it will be much better to see the query itself, really.
John M Gant : Good suggestion. Would the fact that criterion G uses a subquery make the UNION approach less efficient (since theoretically it would happen in two of the SELECTs), or do database engines optimize those types of things?Quassnoi : As for G, it's needed to see the query itself to make any judgements. If the subquery is complex, G could probably kill all benefits from the indexes used on separated UNION'ed queries. Maybe full table scan will be better.John M Gant : The subquery logic is simple, but the table is huge. The subquery is selecting on an indexed field, though, so it may not be too bad. I think this specific case is one where I'll have to try it a couple of different ways and see what works best. - Selecting lowest value: efficient usage of
0 comments:
Post a Comment