|
I want to ask Im sure a really simple question (at laest for most of you). Here is the situation:
I have 3 DB tables: products, tags and products_tags. products_tags has 2 columns product_id and tag_id (for many-to-many relationship). So the question is: how can I select products which have tag1 AND tag2 and tag3 and .... tagN? I need a best way in terms of performance. The only way I can think of now is to create a query with self joins. The query joins number would equal tags number - 1. Is there a better way? Maybe I miss here something fundamental (think). My project products need a way to be filtered by various tags... Thanks in advance. -- Aurimas L. |
|
If this is your only use case you could just have one product table
with an extra "tags" column, here you could list the tags space separated then just do a full text search on that column, this would only take one query. Thanks On Wednesday, June 30, 2010, Aurimas Likas <[hidden email]> wrote: > I want to ask Im sure a really simple question (at laest for most of you). Here is the situation: > > I have 3 DB tables: products, tags and products_tags. products_tags has 2 columns product_id and tag_id (for many-to-many relationship). So the question is: how can I select products which have tag1 AND tag2 and tag3 and .... tagN? I need a best way in terms of performance. The only way I can think of now is to create a query with self joins. The query joins number would equal tags number - 1. Is there a better way? Maybe I miss here something fundamental (think). My project products need a way to be filtered by various tags... > > Thanks in advance. > -- > Aurimas L. > |
|
In reply to this post by Aurimas Likas
On Jun 30, 2010, at 3:28 PM, Aurimas Likas wrote:
> I want to ask Im sure a really simple question (at laest for most of you). Here is the situation: > > I have 3 DB tables: products, tags and products_tags. products_tags has 2 columns product_id and tag_id (for many-to-many relationship). So the question is: how can I select products which have tag1 AND tag2 and tag3 and .... tagN? I need a best way in terms of performance. The only way I can think of now is to create a query with self joins. The query joins number would equal tags number - 1. Is there a better way? Maybe I miss here something fundamental (think). My project products need a way to be filtered by various tags... > > Thanks in advance. > > -- > Aurimas L. I had to do something similar a few years ago. Here is my code, except articles = products, and keywords = tags: $pKeywords = array("any", "keywords", "array"); //keyword array sent into function $kwSelect = $articles->select (); $kwSelect->setIntegrityCheck ( false ); $kwSelect->from ( array ('ak' => 'articles_keywords' ), array ('ak.article_id' ) ); $kwSelect->joinInner ( array ('k' => 'keywords' ), 'k.id = ak.keyword_id', array () ); foreach ( $pKeywords as $kw ) { $kwSelect->orWhere ( 'k.slug = ?', $kw ); } $kwSelect->group ( array ('ak.article_id' ) ); if($this->_kwIntersection == true) $kwSelect->having ( 'COUNT(ak.article_id)=' . count ( $pKeywords ) ); $select->where ( 'a.id IN (?)', new Zend_Db_Expr ( $kwSelect ) ); Not sure if this is the best way to handle your scenario, but it seemed to work for me. In this example, if _kwIntersection is true, then it's an OR operation, if it's false, then it's an AND operation. Also, there is another field in the database called "slug". The "slug" is the URL-safe version of the keyword, so for example, "Mechanical & Aerospace Engineering" slug would be "mechanical-aerospace-engineering", but your script might not need a slug. Hope this helps. -Henry |
|
In reply to this post by Aurimas Likas
On Wed, Jun 30, 2010 at 3:28 PM, Aurimas Likas <[hidden email]> wrote:
> I want to ask Im sure a really simple question (at laest for most of you). > Here is the situation: > > I have 3 DB tables: products, tags and products_tags. products_tags has 2 > columns product_id and tag_id (for many-to-many relationship). So the > question is: how can I select products which have tag1 AND tag2 and tag3 and > .... tagN? I need a best way in terms of performance. The only way I can > think of now is to create a query with self joins. The query joins number > would equal tags number - 1. Is there a better way? Maybe I miss here > something fundamental (think). My project products need a way to be filtered > by various tags... > > Thanks in advance. > > -- > Aurimas L. > Does this work? <?php $tags = array('foo', 'bar', 'widget'); $inner = $db->select() ->from('product_tags', 'product_id') ->join('tags', '`product_tags`.`tag_id` = `tags`.`tag_id`') ->where('`tags`.`tag` IN (?)', $tags) ->where('`products`.`product_id` = `product_tags`.`product_id`') ->group('`products`.`product_id`') ->having('COUNT(*) = ?', count($tags)); $select = $db->select() ->from('products', array('product_id', 'product_name', 'price')) ->where('EXISTS (?)', $inner); ?> It should generate a statement that looks something like this: SELECT `products`.`product_id`, `products`.`product_name`, `products`.`price` FROM `products` WHERE EXISTS ( SELECT `product_id` FROM `product_tags` INNER JOIN `tags` ON `tags`.`tag_id` = `product_tags`.`tag_id` WHERE `tags`.`tag` IN ( 'foo', 'bar', 'widget' ) AND `product_tags`.`product_id` = `products`.`product_id` GROUP BY `products`.`product_id` HAVING COUNT(*) = 3 ) It uses an aggregate sub-query to find rows for product IDs that match all the tags, and then applies the result set to the products table. Andrew |
|
In reply to this post by Aurimas Likas
On Jun 30, 2010, at 12:28 PM, Aurimas Likas wrote: > I have 3 DB tables: products, tags and products_tags. products_tags > has 2 columns product_id and tag_id (for many-to-many relationship). > So the question is: how can I select products which have tag1 AND > tag2 and tag3 and .... tagN? I need a best way in terms of > performance. The only way I can think of now is to create a query > with self joins. The query joins number would equal tags number - 1. Using N-1 self-joins is the best solution for performance, especially if you use MySQL. It appears to require a lot of joins, and common wisdom is that you should avoid joins. SELECT p.* FROM products p JOIN products_tags pt1 ON p.product_id = pt1.product_id JOIN tags t1 ON pt1.tag_id = t1.tag_id AND t1.tag = 'medical' JOIN products_tags pt2 ON p.product_id = pt2.product_id JOIN tags t2 ON pt1.tag_id = t2.tag_id AND t2.tag = 'performance' JOIN products_tags pt3 ON p.product_id = pt3.product_id JOIN tags t3 ON pt3.tag_id = t3.tag_id AND t3.tag = 'reporting'; If you use natural keys for the tags table instead of a pseudokey, then your products_tags table contains the value you need, and you an eliminate half of those joins. For example: CREATE TABLE tags ( tag VARCHAR(20) PRIMARY KEY -- natural key, not pseudokey ); CREATE TABLE products_tags ( product_id INT NOT NULL, tag VARCHAR(20) NOT NULL, -- natural key, not pseudokey PRIMARY KEY (product_id, tag), FOREIGN KEY (product_id) REFERENCES products(product_id), FOREIGN KEY (tag) REFERENCES tags(tag) ); SELECT p.* FROM products p JOIN products_tags pt1 ON p.product_id = pt1.product_id AND pt1.tag = 'medical' JOIN products_tags pt2 ON p.product_id = pt2.product_id AND pt2.tag = 'performance' JOIN products_tags pt3 ON p.product_id = pt3.product_id AND pt3.tag = 'reporting'; The primary key index in products_tags gives this good performance. Your EXPLAIN report should say "Using index" which means it can satisfy the query simply by reading the index, without reading the data for the products_tags table at all. If your index is cached in database memory, this runs very fast. A different solution that some people prefer is to use GROUP BY and include groups that have as many rows as the number of tags: SELECT p.* FROM products p JOIN products_tags pt ON p.product_id = pt.product_id WHERE pt.tag IN ('medical', 'performance', 'reporting') GROUP BY p.product_id -- EXPLAIN reports "Using temporary; Using filesort" HAVING COUNT(*) = 3; Common wisdom says that one join is always better than three or six, but this actually performs much *worse* (if you use MySQL), because virtually any query using GROUP BY uses a temporary table, which kills performance because disk I/O is expensive, and temporary tables usually write to disk. This might not be such a bad solution if you use some other brand of database that optimizes GROUP BY better than MySQL. So you should analyze queries with EXPLAIN (or equivalent) and profile carefully with sample data of realistic size. Regards, Bill Karwin |
|
On Jun 30, 2010, at 2:12 PM, Aurimas Likas wrote: > So, Bill, if my only choice is MySQL and if I dont need to join tags > table itself (I provide tag IDs to the function selecting products > and I dont need any tags data (their IDs are used only for > filtering)), I think I can leave pseudo keys (they should be faster > than natural ones) and get pretty good performance?? Tags number > should be relatively small (up to 10, usually up to 3-5). Yes, if you know the integer tag_id values, you can continue to use pseudokeys and still eliminate the extra three joins. There's a performance advantage to using integers instead of strings, because integer comparisons are a bit faster and also the total space required for your index is smaller, so it takes less memory to keep it in cache. But these advantages are actually pretty small compared to the advantage of avoiding the cost of using a temporary table. So you have already achieved a big win by using joins instead of group by. Any extra benefit from using integer pseudokeys is icing on the cake. :-) Regards, Bill Karwin |
|
In reply to this post by BillKarwin
On Wed, Jun 30, 2010 at 4:46 PM, Bill Karwin <[hidden email]> wrote:
> Common wisdom says that one join is always better than three or six, but > this actually performs much *worse* (if you use MySQL), because virtually > any query using GROUP BY uses a temporary table, which kills performance > because disk I/O is expensive, and temporary tables usually write to disk. > > This might not be such a bad solution if you use some other brand of > database that optimizes GROUP BY better than MySQL. So you should analyze > queries with EXPLAIN (or equivalent) and profile carefully with sample data > of realistic size. > > Regards, > Bill Karwin > Interesting. I'll defer to you on MySQL since you know it much better than I (and I don't have anything running on MySQL that I could test quickly right now anyway). Still, the idea left me somewhat curious, so I ran a test on one of our SQL Server databases to see how some different strategies compared. In the interest of disclosure, the actual tables I used (renamed in the examples below) are not very large as databases go. The "products" table has almost 54000 rows, the "tags" table has almost 350, and the "product_tags" table has about 132000 rows. (The nature of the real tables that I tested is that each row in the "products" table has at least one row in the "product_tags" table and only six distinct values from "tags" have been used so far in "product_tags".) I compared three different query strategies (shown below). Of these, METHOD 1 had the lowest overall cost in resources, followed closely by METHOD 3 (although the total elapsed time for METHOD3 was slightly faster than that for METHOD 1). I did notice that METHOD 3 grew more expensive the more tags I included in the search since each tag requires additional self-joins. Given that the code for METHOD 1 was the least complex (and therefore the easiest to define using Zend_Db_Select, just to keep this somewhat on topic) and had the lowest cost and second lowest overall elapsed time, I would go that route on SQL Server. As I understand, you're saying this is not necessarily the case for MySQL. -- METHOD 1 SELECT P.product_id, P.product_name, P.price FROM products AS P INNER JOIN product_tags AS PT ON P.product_id = PT.product_id INNER JOIN tags AS T ON T.tag_id = PT.tag_id WHERE T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' ) GROUP BY P.product_id, P.product_name, P.price HAVING COUNT(DISTINCT T.tag_id) = 5 -- METHOD 2 SELECT products.product_id, products.product_name, products.price FROM products WHERE EXISTS ( SELECT product_tags.product_id FROM product_tags INNER JOIN tags ON tags.tag_id = product_tags.tag_id WHERE tags.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' ) AND product_tags.product_id = products.product_id GROUP BY product_tags.product_id HAVING COUNT(*) = 5 ) -- METHOD 3 SELECT P.product_id, P.product_name, P.price FROM products AS P INNER JOIN product_tags AS PT1 ON PT1.product_id = P.product_id INNER JOIN tags AS T1 ON T1.tag_id = PT1.tag_id AND T1.tag = 'foo' INNER JOIN product_tags AS PT2 ON PT2.product_id = P.product_id INNER JOIN tags AS T2 ON T2.tag_id = PT2.tag_id AND T2.tag = 'bar' INNER JOIN product_tags AS PT3 ON PT3.product_id = P.product_id INNER JOIN tags AS T3 ON T3.tag_id = PT3.tag_id AND T3.tag = 'baz' INNER JOIN product_tags AS PT4 ON PT4.product_id = P.product_id INNER JOIN tags AS T4 ON T4.tag_id = PT4.tag_id AND T4.tag = 'floob' INNER JOIN product_tags AS PT5 ON PT5.product_id = P.product_id INNER JOIN tags AS T5 ON T5.tag_id = PT5.tag_id AND T5.tag = 'widget' Results: METHOD 1: Query cost as calculated by Query Analyzer (relative to batch) 3% ---------------- Table 'products'. Scan count 1, logical reads 294, physical reads 0, read-ahead reads 0. Table 'product_tags'. Scan count 1, logical reads 344, physical reads 0, read-ahead reads 0. Table 'tags'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0. SQL Server Execution Times: CPU time = 156 ms, elapsed time = 510 ms. (7220 row(s) affected) METHOD 2: Query cost as calculated by Query Analyzer (relative to batch) 89% ---------------- Table 'tags'. Scan count 131715, logical reads 263430, physical reads 0, read-ahead reads 0. Table 'product_tags'. Scan count 53968, logical reads 108421, physical reads 0, read-ahead reads 0. Table 'products'. Scan count 1, logical reads 294, physical reads 0, read-ahead reads 0. SQL Server Execution Times: CPU time = 953 ms, elapsed time = 1021 ms. (7220 row(s) affected) METHOD 3: Query cost as calculated by Query Analyzer (relative to batch) 9% ---------------- Table 'tags'. Scan count 5, logical reads 25, physical reads 0, read-ahead reads 0. Table 'product_tags'. Scan count 5, logical reads 1720, physical reads 0, read-ahead reads 0. Table 'products'. Scan count 1, logical reads 294, physical reads 0, read-ahead reads 0. SQL Server Execution Times: CPU time = 266 ms, elapsed time = 488 ms. (7220 row(s) affected) Andrew |
|
Hi Andrew,
Just jumping in, here is another approach: SELECT P.product_id, P.product_name, P.price FROM products AS P INNER JOIN product_tags AS PT ON P.product_id = PT.product_id INNER JOIN tags AS T ON T.tag_id = PT.tag_id AND T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' ) GROUP BY P.product_id HAVING COUNT(DISTINCT T.tag_id) = 5 Theoretically, this should result in a smaller temporary table, since moving the IN check to the join condition will limit the records stored in a temporary table. Also, assuming that product_id is a primary key, you should be save in grouping on just P.product_id. Related to this, I recently had a curious issue. While rewriting a subquery into a join, performance dropped dramatically. My hypothesis was that the amount of columns MySQL had to store in the temporary table was so large, that creating a reduced set of columns with a subquery was actually beneficial. Vincent de Lau [hidden email] > -----Oorspronkelijk bericht----- > Van: Andrew Ballard [mailto:[hidden email]] > Verzonden: donderdag 1 juli 2010 17:36 > Aan: Bill Karwin > CC: [hidden email] > Onderwerp: Re: [fw-general] Question about SQL (not ZF related, sorry) > > On Wed, Jun 30, 2010 at 4:46 PM, Bill Karwin <[hidden email]> wrote: > > Common wisdom says that one join is always better than three or six, > but > > this actually performs much *worse* (if you use MySQL), because > virtually > > any query using GROUP BY uses a temporary table, which kills > performance > > because disk I/O is expensive, and temporary tables usually write to > disk. > > > > This might not be such a bad solution if you use some other brand of > > database that optimizes GROUP BY better than MySQL. So you should > analyze > > queries with EXPLAIN (or equivalent) and profile carefully with > sample data > > of realistic size. > > > > Regards, > > Bill Karwin > > > > Interesting. I'll defer to you on MySQL since you know it much better > than I (and I don't have anything running on MySQL that I could test > quickly right now anyway). Still, the idea left me somewhat curious, > so I ran a test on one of our SQL Server databases to see how some > different strategies compared. > > In the interest of disclosure, the actual tables I used (renamed in > the examples below) are not very large as databases go. The "products" > table has almost 54000 rows, the "tags" table has almost 350, and the > "product_tags" table has about 132000 rows. (The nature of the real > tables that I tested is that each row in the "products" table has at > least one row in the "product_tags" table and only six distinct values > from "tags" have been used so far in "product_tags".) > > I compared three different query strategies (shown below). Of these, > METHOD 1 had the lowest overall cost in resources, followed closely by > METHOD 3 (although the total elapsed time for METHOD3 was slightly > faster than that for METHOD 1). I did notice that METHOD 3 grew more > expensive the more tags I included in the search since each tag > requires additional self-joins. > > Given that the code for METHOD 1 was the least complex (and therefore > the easiest to define using Zend_Db_Select, just to keep this somewhat > on topic) and had the lowest cost and second lowest overall elapsed > time, I would go that route on SQL Server. As I understand, you're > saying this is not necessarily the case for MySQL. > > -- METHOD 1 > SELECT P.product_id, P.product_name, P.price > FROM products AS P > INNER JOIN product_tags AS PT > ON P.product_id = PT.product_id > INNER JOIN tags AS T > ON T.tag_id = PT.tag_id > WHERE T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' ) > GROUP BY P.product_id, P.product_name, P.price > HAVING COUNT(DISTINCT T.tag_id) = 5 > > > -- METHOD 2 > SELECT products.product_id, products.product_name, products.price > FROM products > WHERE EXISTS ( > SELECT product_tags.product_id > FROM product_tags > INNER JOIN tags > ON tags.tag_id = product_tags.tag_id > WHERE tags.tag IN ( 'foo', 'bar', 'baz', > 'floob', 'widget' ) > AND product_tags.product_id = > products.product_id > GROUP BY product_tags.product_id > HAVING COUNT(*) = 5 > ) > > > -- METHOD 3 > SELECT P.product_id, P.product_name, P.price > FROM products AS P > INNER JOIN product_tags AS PT1 > ON PT1.product_id = P.product_id > INNER JOIN tags AS T1 > ON T1.tag_id = PT1.tag_id > AND T1.tag = 'foo' > INNER JOIN product_tags AS PT2 > ON PT2.product_id = P.product_id > INNER JOIN tags AS T2 > ON T2.tag_id = PT2.tag_id > AND T2.tag = 'bar' > INNER JOIN product_tags AS PT3 > ON PT3.product_id = P.product_id > INNER JOIN tags AS T3 > ON T3.tag_id = PT3.tag_id > AND T3.tag = 'baz' > INNER JOIN product_tags AS PT4 > ON PT4.product_id = P.product_id > INNER JOIN tags AS T4 > ON T4.tag_id = PT4.tag_id > AND T4.tag = 'floob' > INNER JOIN product_tags AS PT5 > ON PT5.product_id = P.product_id > INNER JOIN tags AS T5 > ON T5.tag_id = PT5.tag_id > AND T5.tag = 'widget' > > > > Results: > METHOD 1: Query cost as calculated by Query Analyzer (relative to > batch) 3% > ---------------- > Table 'products'. Scan count 1, logical reads 294, physical reads 0, > read-ahead reads 0. > Table 'product_tags'. Scan count 1, logical reads 344, physical reads > 0, read-ahead reads 0. > Table 'tags'. Scan count 1, logical reads 5, physical reads 0, > read-ahead reads 0. > SQL Server Execution Times: > CPU time = 156 ms, elapsed time = 510 ms. > > (7220 row(s) affected) > > METHOD 2: Query cost as calculated by Query Analyzer (relative to > batch) 89% > ---------------- > Table 'tags'. Scan count 131715, logical reads 263430, physical reads > 0, read-ahead reads 0. > Table 'product_tags'. Scan count 53968, logical reads 108421, physical > reads 0, read-ahead reads 0. > Table 'products'. Scan count 1, logical reads 294, physical reads 0, > read-ahead reads 0. > SQL Server Execution Times: > CPU time = 953 ms, elapsed time = 1021 ms. > > (7220 row(s) affected) > > METHOD 3: Query cost as calculated by Query Analyzer (relative to > batch) 9% > ---------------- > Table 'tags'. Scan count 5, logical reads 25, physical reads 0, > read-ahead reads 0. > Table 'product_tags'. Scan count 5, logical reads 1720, physical reads > 0, read-ahead reads 0. > Table 'products'. Scan count 1, logical reads 294, physical reads 0, > read-ahead reads 0. > SQL Server Execution Times: > CPU time = 266 ms, elapsed time = 488 ms. > > (7220 row(s) affected) > > > Andrew |
|
On Thu, Jul 1, 2010 at 11:55 AM, Vincent de Lau <[hidden email]> wrote:
> Hi Andrew, > > Just jumping in, here is another approach: > > SELECT P.product_id, P.product_name, P.price > FROM > products AS P > INNER JOIN product_tags AS PT > ON P.product_id = PT.product_id > INNER JOIN tags AS T > ON T.tag_id = PT.tag_id AND T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' ) > GROUP BY P.product_id > HAVING COUNT(DISTINCT T.tag_id) = 5 > > Theoretically, this should result in a smaller temporary table, since moving the IN check to the join condition will limit the records stored in a temporary table. Also, assuming that product_id is a primary key, you should be save in grouping on just P.product_id. > I don't see any noticeable difference between the two in SQL Server. I'm pretty sure that the optimizer knows that they are the same thing and uses the same plan for both. > Related to this, I recently had a curious issue. While rewriting a subquery into a join, performance dropped dramatically. My hypothesis was that the amount of columns MySQL had to store in the temporary table was so large, that creating a reduced set of columns with a subquery was actually beneficial. > I suppose it is possible. When faced with something like this, I frequently write a few variations to a query and compare them against approximately real data to see which approach performs best in a given situation. Often the results are what I expected, but many times they are not. That's why so many threads on a DBA board I follow include answers of "it depends." Andrew |
| Powered by Nabble | Edit this page |
