Thursday, March 31, 2011

SQL: Select records belonging to excluded category that ONLY belong to excluded category

I have a SELECT statement that works, and runs fast enough on my tables (<0.01sec on 50k+ products, 3k+ categories). But in my mind it's not very elegant and would like to hear any suggestions on making it better.

There are 3 tables of interest:

  • products - key productID
  • categories - key categoryID
  • products_tree - link table (categories contain many products, products can belong to many categories)

I have a list of excluded categoryIDs [e.g. 1040,1050,1168] I want to select all the productIDs that belong to one of these excluded categories only if the product doesn't belong to another NON-excluded category

My Query looks like this:

SELECT DISTINCT productID 
FROM products_tree 
WHERE 
  categoryID IN (1040,1050,1168) 
  AND productID NOT IN
    ( SELECT DISTINCT productID 
      FROM products_tree 
      WHERE 
      categoryID NOT IN (1040,1050,1168)
    );
From stackoverflow
  • I believe your query is quite good, but you could compare it with joins:

    SELECT DISTINCT pt1.productID 
    FROM products_tree pt1
    LEFT JOIN products_tree pt2 ON pt2.productID = pt1.productID 
        AND pt2.categoryID  pt1.categoryID
    
    WHERE pt1.categoryID IN (1040,1050,1168) 
      AND pt2.productID IS NULL
    

    Not sure if I thought correctly, but I think you understand my approach. I would however select productinfo directly if you want that, then the joins would make more sense (inner join the categories you want, left join the ones you don't want and check for null)

  • I can think of a few methods, each of which perform differently depending on indexes and your particular database implementation. Some that may look slow can be optimised in ways you may not have imagined and so it's worth trialling them all and comparing execution plans to see what is happening...

    Note1: I use GROUP BY rather than DISTINCT, this is because it allows the omptimiser to make use of indexes. I've seen implementations work out that they can turn the DISTINCT in to a GROUP BY, but it's highly worth using GROUP BY in the fist place to be sure. It also gets you thinking about indexes, which is never a bad thing.

    Note2: Some queries like this take a while to optimise, as there are many options for the optimiser to evaluate. It is therefore often worth compiling all the different options in to stored procedures and comparing the execution of those stored procedures. This ensures your compare actually Query Time and not different Compile Times.

    SELECT
       [tree].productID
    FROM
       products_tree AS [tree]
    WHERE
       [tree].productID IN (1040,1050,1168)
       AND NOT EXISTS (SELECT * FROM products_tree WHERE productID = [tree].productID AND categoryID NOT IN (1040,1050,1168)) 
    GROUP BY
       [tree].productID
    


    SELECT
       [tree].productID
    FROM
       products_tree AS [tree]
    LEFT OUTER JOIN
       (
          SELECT
             productID
          FROM
             product_tree
          WHERE
             productID NOT IN (1040,1050,1168)
          GROUP BY
             productID
        )
        AS [ok_products]
           ON [ok_products].productID = [tree].productID
    WHERE
       [tree].productID IN (1040,1050,1168)
       AND [ok_products].productID IS NULL 
    GROUP BY
       [tree].productID
    


    SELECT
       [tree].productID
    FROM
       products_tree AS [tree]
    GROUP BY
       [tree].productID
    HAVING
           MAX(CASE WHEN [tree].productID     IN (1040,1050,1168) THEN 1 ELSE 0 END) = 1
       AND MAX(CASE WHEN [tree].productID NOT IN (1040,1050,1168) THEN 1 ELSE 0 END) = 0
    

    There are others, and variations of each, but this should give you a very good start. But I really would stress the use of GROUP BY and the consideration to INDEXES :)

    rwired : ProductID and CategoryID are PRIMARY on their respective tables, and are INDEX on the link table. I changed DISTINCT to GROUP BY and got the exact same performance. I guess the optimizer was noticing that. The SQL you have suggested in academically interesting so I've accepted this answer.
    Dems : Did any of the suggestions out perform the example you gave? My expectation is that (in MSSQLServer) using "NOT EXISTS" in the where clause should be fastest. (Using 'segmentation' to improve the query efficiency.)
  • You could try a "NOT EXISTS" variant:

    SELECT 
      pt.productID 
    FROM 
      products_tree pt
    WHERE 
      pt.categoryID IN (1040,1050,1168)
      AND NOT EXISTS (
        SELECT 1 
        FROM products_tree 
        WHERE productID = pt.productID AND categoryID NOT IN (1040,1050,1168)
      )
    GROUP BY
      pt.productID;
    

0 comments:

Post a Comment