August 22, 2007

LINQ to SQL: Many-To-Many Tables and Joins

Still having fun with LINQ to SQL over here. One quick note, I've found that the best way to test LINQ queries is to have a unit test class ready to go. Makes things much easier.

So what I've been playing with lately is using LINQ to SQL for querying against Many-To-Many tables. For an illustration, I'll use a table structure like this: (click on it to see a bigger version)


Once imported into your Linq to Sql file type in visual studio, and viewed through a class diagram, you get this:



In my Customer table I have 5 customers, I also have 5 products (Product ID=1 is "Computer"). The CustomerProduct table has about 20 records.

Now we can start having fun with some queries. The question I wanted to answer was: "Which customers have a particular product?". Simple enough.

I found, as with many things, there are multiple ways of getting the same answer. My first query looked like this:


int Computer = 1;
LinqTestsDataContext cx = new LinqTestsDataContext();
var customersWithProduct = from c in cx.Customers
from cp in c.CustomerProducts
where cp.ProductID == Computer
select c;


OK, background note. This code is using the LinqTestsDataContext object to query the Customer table and the CustomerProduct table to find all of the customers with a Computer (product id = 1). What is returned is a IQueryiable object. If I want to parse through each Customer object indivitually, can call customersWithProduct.ToList() and get a List object returned.

This lovely piece of Linq generates the following SQL code:


{SELECT [t0].[ID], [t0].[Name]
FROM [dbo].[Customer] AS [t0], [dbo].[CustomerProduct] AS [t1]
WHERE ([t1].[ProductID] = @p0) AND ([t1].[CustomerID] = [t0].[ID])}


How do I know that is the SQL that is generated? After I run that line I can mouse over the variable (customersWithProduct) and the tooltip displays the generated SQL. I cant change it (that I know of), but at least I can look at it.

Anyway, that is not what I would call the best SQL I have ever seen -- and it is slow.

Next came attempt two at Linq to SQL. I wrote this:


int Computer = 1;
LinqTestsDataContext cx = new LinqTestsDataContext();
var customersWithProduct = from c in cx.Customers
join cp in cx.CustomerProducts on c.ID equals cp.CustomerID
where cp.ProductID == Computer
select c;


In the previous example, I used that Customer.CustomerProducts object to filter the products. This time I am explicitly joining the Customers and the CustomerProducts tables together in Linq. The only odd part of the query was the "equal" keyword that you have to use in the join.

The SQL generated was much better:


{SELECT [t0].[ID], [t0].[Name]
FROM [dbo].[Customer] AS [t0]
INNER JOIN [dbo].[CustomerProduct] AS [t1] ON [t0].[ID] = [t1].[CustomerID]
WHERE [t1].[ProductID] = @p0}


Look: an actual join. Trust me, this works much faster. The first query took 3.24 seconds, the second took 0.06 seconds. I call that significant. Especially considering the amount of data I am querying (not much). You add some real data (thousand and millions of records) and you could be talking about some significant downtime.

Lots more to discover here. All a matter of time.

7 comments:

Anonymous said...

I don't think whether you use the inner join syntax or an implicit inner join makes a difference in performance.

Chris Brandsma said...

You "think" it doesn't, or do you have PROOF that it doesn't?

I don't care what your gut feeling is telling you, I want actual numbers. That is the first rule of performance testing: don't trust your gut, measure instead.

Anonymous said...

Well how did you get your numbers? Did you run your test multiple times with the same results? Are you sure all other variables are the same?

I agree with you...measure, but make sure you do it right.

The sql server optimizer is pretty good. I'd be willing to bet it can figure out a simple implicit inner join.
t handle

Anonymous said...

If you check the query execution plan you will notice they are exactly the same for either syntax.

Your difference in numbers must be from extraneous factors.

You are breaking an important rule that says you shouldn't prematurely optimize your code. It leads to more complicated designs. And in this particular case you are gaining nothing from it.

Chris Brandsma said...

As I don't remember where that code is anymore...I'll have to recreate the solution.

But, I'll take your word for it on the execution plan for right now. If you are correct, then the most likely answer is that I had bad indexes in my tables.

First, a note on my testing techniques: I was doing all of my performance tests using NUnit and ReSharper, each test was executed 10 times.

Was the optimization premature? I don't think so. I was running similar code as part of a larger application. The code was running slowly, so I ran a profiler (the team systems profiler) to find the slow part. That code was the slow part -- so I changed it. There was nothing premature about it.

Now, would I have been better off to optimize my indexes instead of that query? Possibly. The part that gets hairy is when you start working with larger datasets (mine was admittedly small). At that point, if you see a slowdown it is most often an indexing issue, and best attached there.

Anonymous said...

May just be a caching issue also. The first time you get the data is slow, but subsequent gets are much faster.
Macrel

Anonymous said...

Informative. Liked it. Thanks.