marko devcic

  • github:
    deva666
  • email:
    madevcic {at} gmail.com

Group by in LINQ to SQL

Posted on 3 August 2014

Up to now I haven't really bothered with how EF is translating my LINQ to SQL queries to raw SQL, because I really haven't had an issue with them. But recently I had one method that returns the data from the database execute for almost 2 minutes.

So here's the query:

var result = (from u in context.Users
              join a in context.Applications on u.UserID equals a.UserID
              join w in context.Windows on a.ApplicationID equals w.ApplicationID
              join l in context.Logs on w.WindowID equals l.WindowID
              where u.UserID == userID
              && l.DateCreated >= dateFrom
              && l.DateCreated <= dateTo
              group l by new { year = l.DateCreated.Year,
              month = l.DateCreated.Month,
              day = l.DateCreated.Day, name = a.Name } into g
              select g).ToList();

On the first look, there's nothing special about it, join 4 tables, filter the data and group by days. It's worth noting that Logs table had around 40k rows, other tables less than 10 each.

And here's the SQL query that EF produced :

SELECT 
    [Project2].[C5] AS [C1], 
    [Project2].[C1] AS [C2], 
    [Project2].[C2] AS [C3], 
    [Project2].[C3] AS [C4], 
    [Project2].[C4] AS [C5], 
    [Project2].[Name] AS [Name], 
    [Project2].[C6] AS [C6], 
    [Project2].[LogID] AS [LogID], 
    [Project2].[WindowID] AS [WindowID], 
    [Project2].[DateCreated] AS [DateCreated], 
    [Project2].[Keystrokes] AS [Keystrokes], 
    [Project2].[KeystrokesRaw] AS [KeystrokesRaw], 
    [Project2].[DateEnded] AS [DateEnded], 
    [Project2].[UsageID] AS [UsageID]
    FROM ( SELECT 
        [Distinct1].[Name] AS [Name], 
        [Distinct1].[C1] AS [C1], 
        [Distinct1].[C2] AS [C2], 
        [Distinct1].[C3] AS [C3], 
        [Distinct1].[C4] AS [C4], 
        1 AS [C5], 
        [Join4].[LogID] AS [LogID], 
        [Join4].[WindowID1] AS [WindowID], 
        [Join4].[DateCreated] AS [DateCreated], 
        [Join4].[Keystrokes] AS [Keystrokes], 
        [Join4].[KeystrokesRaw] AS [KeystrokesRaw], 
        [Join4].[DateEnded] AS [DateEnded], 
        [Join4].[UsageID] AS [UsageID], 
        CASE WHEN ([Join4].[UserID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C6]
        FROM   (SELECT DISTINCT 
            [Extent1].[Name] AS [Name], 
            1 AS [C1], 
            DATEPART (year, [Extent3].[DateCreated]) AS [C2], 
            DATEPART (month, [Extent3].[DateCreated]) AS [C3], 
            DATEPART (day, [Extent3].[DateCreated]) AS [C4]
            FROM   [Aplications] AS [Extent1]
            INNER JOIN [Windows] AS [Extent2] ON [Extent1].[ApplicationID] = [Extent2].[ApplicationID]
            INNER JOIN [Logs] AS [Extent3] ON [Extent2].[WindowID] = [Extent3].[WindowID]
            WHERE ([Extent1].[UserID] = @p__linq__0) AND ([Extent3].[DateCreated] >= @p__linq__1) AND ([Extent3].[DateCreated] <= @p__linq__2) ) AS [Distinct1]
        LEFT OUTER JOIN  (SELECT [Extent4].[ApplicationID] AS [ApplicationID1], [Extent4].[Name] AS [Name], [Extent4].[FileName] AS [FileName], [Extent4].[Version] AS [Version], [Extent4].[Description] AS [Description], [Extent4].[Company] AS [Company], [Extent4].[UserID] AS [UserID], [Extent4].[WinName] AS [WinName], [Extent5].[WindowID] AS [WindowID2], [Extent5].[ApplicationID] AS [ApplicationID2], [Extent5].[Title] AS [Title], [Extent6].[LogID] AS [LogID], [Extent6].[WindowID] AS [WindowID1], [Extent6].[DateCreated] AS [DateCreated], [Extent6].[Keystrokes] AS [Keystrokes], [Extent6].[KeystrokesRaw] AS [KeystrokesRaw], [Extent6].[DateEnded] AS [DateEnded], [Extent6].[UsageID] AS [UsageID]
            FROM   [Aplications] AS [Extent4]
            INNER JOIN [Windows] AS [Extent5] ON [Extent4].[ApplicationID] = [Extent5].[ApplicationID]
            INNER JOIN [Logs] AS [Extent6] ON [Extent5].[WindowID] = [Extent6].[WindowID] ) AS [Join4] ON ([Distinct1].[Name] = [Join4].[Name]) AND ([Join4].[UserID] = @p__linq__0) AND ([Join4].[DateCreated] >= @p__linq__1) AND ([Join4].[DateCreated] <= @p__linq__2) AND (([Distinct1].[C2] = (DATEPART (year, [Join4].[DateCreated]))) OR (([Distinct1].[C2] IS NULL) AND (DATEPART (year, [Join4].[DateCreated]) IS NULL))) AND (([Distinct1].[C3] = (DATEPART (month, [Join4].[DateCreated]))) OR (([Distinct1].[C3] IS NULL) AND (DATEPART (month, [Join4].[DateCreated]) IS NULL))) AND (([Distinct1].[C4] = (DATEPART (day, [Join4].[DateCreated]))) OR (([Distinct1].[C4] IS NULL) AND (DATEPART (day, [Join4].[DateCreated]) IS NULL)))
    )  AS [Project2]
    ORDER BY [Project2].[C1] ASC, [Project2].[C2] ASC, [Project2].[C3] ASC, [Project2].[C4] ASC, [Project2].[Name] ASC, [Project2].[C6] ASC

Whoa, no wonder it took almost 2 minutes to execute this , select distinct is not very efficient and when I'm writing raw SQL, I almost never use it.

So I fired up the LinqPad and tried to get the results I needed without the distincts:

select a.Name, l.DateCreated, l.DateEnded, filter.ayear, filter.amonth, filter.aday 
from Uzers u
join Aplications a on u.UserID = a.UserID
join Windows w on a.ApplicationID = w.ApplicationID
join Logs l on w.WindowID = l.WindowID
join
(select a.Name as AppName
        , DATEPART (year, l.DateCreated) as AYear
        , DATEPART (month, l.DateCreated) as AMonth
        , DATEPART (day, l.DateCreated) as ADay
from Uzers u
join Aplications a on u.UserID = a.UserID
join Windows w on a.ApplicationID = w.ApplicationID
join Logs l on w.WindowID = l.WindowID
where u.UserID = 1
and l.DateCreated > '01-01-2014'
and l.DateCreated < '01-01-2015'
group by a.Name
        , DATEPART (year, l.DateCreated)
        , DATEPART (month, l.DateCreated)
        , DATEPART (day, l.DateCreated)) as filter on filter.AppName = a.name and filter.Ayear = DATEPART (year, l.DateCreated) and filter.amonth =DATEPART (month, l.DateCreated) and filter.aday = DATEPART (day, l.DateCreated

        )
where u.UserID = 1
order by filter.ayear, filter.amonth, filter.aday

This took ~40secs, that is almost 3 times faster, but still not good enough.

I guess it could be optimized even further, but I decided to drop the grouping in the SQL and use it to only do initial filtering and group in code after.

var result = (from u in context.Users
       join a in context.Applications on u.UserID equals a.UserID
       join w in context.Windows on a.ApplicationID equals w.ApplicationID
       join l in context.Logs on w.WindowID equals l.WindowID
       where u.UserID ==  userID
       && l.DateCreated >= dateFrom
       && l.DateCreated <= dateTo
       orderby l.DateCreated
       select l).Include(l => l.Window.Application).ToList();

var grouped = (from l in result
      group l by new { year = l.DateCreated.Year, month = l.DateCreated.Month, day = l.DateCreated.Day, name = l.Window.Application.Name } into g
      select g).ToList();

And now this method executes for ~5 seconds. Not bad, from 2 minutes down to 5 seconds is good enough reason to drop the group by in the queries.

Also, this made me pay more attention to what the EF is doing, so now I have the following code in my Context constructor.

#if DEBUG
            Database.Log = FlushSql;
#endif
        private void FlushSql(string s)
        {
            Console.WriteLine(s);
        }

From EF 6.0 onwards the Database.Log property accepts an Action delegate that send all SQL queries generated to it.