Tuesday, 13 March 2012

Using PARTITION and RANK in Sql



The RANK and PARTITION features in 2005 are simply amazing.  They make so many "classic" SQL problems very easy to solve.  For example, consider the following table:

create table Batting
 (Player varchar(10), Year int, Team varchar(10), HomeRuns int, primary key(Player,Year))

insert into Batting
select 'A',2001,'Red Sox',13 union all
select 'A',2002,'Red Sox',23 union all
select 'A',2003,'Red Sox',19 union all
select 'A',2004,'Red Sox',14 union all
select 'A',2005,'Red Sox',11 union all
select 'B',2001,'Yankees',42 union all
select 'B',2002,'Yankees',39 union all
select 'B',2003,'Yankees',42 union all
select 'B',2004,'Yankees',29 union all
select 'C',2002,'Yankees',2 union all
select 'C',2003,'Yankees',3 union all
select 'C',2004,'Red Sox',6 union all
select 'C',2005,'Red Sox',9


Suppose we would like to find out which year each player hit their most home runs, and which team they played for.  As a tie-breaker, return the latest year.

In SQL 2000, we had to do it in two steps:  First, get the MAX(HomeRuns) per player, and then join back to the Batting table to return the rest of the data:

select b.*
from
batting b
inner join
(    select player, Max(HomeRuns) as MaxHR
    from Batting
    group by player
) m
on b.Player = m.player and b.HomeRuns = m.MaxHR

Player     Year        Team       HomeRuns  
---------- ----------- ---------- -----------
A          2002        Red Sox    23
B          2001        Yankees    42
B          2003        Yankees    42
C          2005        Red Sox    9

(4 row(s) affected)


Note that for player 'B', we get two rows back since he has two years that tie for the most home runs (2001 and 2003).  How do we return just the latest year?  Guess what -- it requires *another* pass through the table:

select b.*
from
batting b
inner join
(    select player, Max(HomeRuns) as MaxHR
    from Batting
    group by player
) m
    on b.Player = m.player and b.HomeRuns = m.MaxHR
inner join
(  select player, homeRuns, Max(Year) as MaxYear
   from Batting
   group by Player, HomeRuns
) y
   on m.player= y.player and m.maxHR = y.HomeRuns and b.Year = y.MaxYear

Player     Year        Team       HomeRuns  
---------- ----------- ---------- -----------
C          2005        Red Sox    9
A          2002        Red Sox    23
B          2003        Yankees    42

(3 row(s) affected)

An alternate way to do this is to calculate the "ranking" of each home run for each player, using a correlated subquery:

select b.*,
  (select count(*) from batting b2 where b.player = b2.player and b2.HomeRuns >= b.HomeRuns) as Rank
from batting b

Player     Year        Team       HomeRuns    Rank      
---------- ----------- ---------- ----------- -----------
A          2001        Red Sox    13          4
A          2002        Red Sox    23          1
A          2003        Red Sox    19          2
A          2004        Red Sox    14          3
A          2005        Red Sox    11          5
B          2001        Yankees    42          2
B          2002        Yankees    39          3
B          2003        Yankees    42          2
B          2004        Yankees    29          4
C          2002        Yankees    2           4
C          2003        Yankees    3           3
C          2004        Red Sox    6           2
C          2005        Red Sox    9           1

However, notice that we still have not handled ties! (notice that Player "B" has no #1 ranking, just two #2 rankings!) To do that, we must make things a little more complicated:

select b.*,
  (select count(*) from batting b2 where b.player = b2.player and (b2.HomeRuns > b.HomeRuns or (b2.HomeRuns = b.HomeRuns and b2.Year >= b.Year))) as Rank
from batting b

Player     Year        Team       HomeRuns    Rank      
---------- ----------- ---------- ----------- -----------
A          2001        Red Sox    13          4
A          2002        Red Sox    23          1
A          2003        Red Sox    19          2
A          2004        Red Sox    14          3
A          2005        Red Sox    11          5
B          2001        Yankees    42          2
B          2002        Yankees    39          3
B          2003        Yankees    42          1
B          2004        Yankees    29          4
C          2002        Yankees    2           4
C          2003        Yankees    3           3
C          2004        Red Sox    6           2
C          2005        Red Sox    9           1

(13 row(s) affected)

And, with that, we can use our "ranking" formula to return only the #1 rankings to get our results by moving the subquery to the WHERE clause:

select b.*
from batting b
where (select count(*) from batting b2 where b.player = b2.player and (b2.HomeRuns > b.HomeRuns or (b2.HomeRuns = b.HomeRuns and b2.Year >= b.Year))) =1

Player     Year        Team       HomeRuns  
---------- ----------- ---------- -----------
C          2005        Red Sox    9
A          2002        Red Sox    23
B          2003        Yankees    42

(3 row(s) affected)

And that's all you need to do this in SQL 2000 ! Easy, right?   Hmmm ... well, maybe not!

The new ranking functions in SQL Server 2005 let us quickly calculate each row's ranking within a set based on a partition and an ordering. Think of the partition as almost like a GROUP BY, where you the use of the word "per" in your specifications often indicate which columns you are grouping or partitioning on.  Since we want to return the top Home Run year per Player, we partition on Player.  This means that the rows for each Player will get a value of 1-x for their ranking.  The ranking itself is based on HomeRuns, so we want to return the rank for each partition ORDER'ed BY HomeRuns from high-to-low (DESC).

Using the new features, in SQL 2005 we can use the following SELECT to return each year's Home Run ranking for each Player:

select Player, Year, HomeRuns, Rank() over (Partition BY Player order by HomeRuns DESC) as Rank
from
Batting

Now, like before, we have to deal with ties.  But now, it is much easier -- we just add a secondary sort.  Since we want to the latest year to rank higher, we just add "Year DESC" to our ORDER BY:

select Player, Year, HomeRuns,Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) as Rank
from
Batting

Much easier than the old way!  However, we are still not done.  The above SQL returns all years with the Ranking beside it.  We are only interested in returning all rows with a Rank of 1.  However, if we try:

select Player, Year, HomeRuns
from
Batting
where
 Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) = 1

We get an error:

Windowed functions can only appear in the SELECT or ORDER BY clauses.

We can easily fix this by wrapping the SELECT in a derived table:

select * from
(
select Player, Year, HomeRuns,  Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) as Rank
from
Batting
) tmp
where Rank = 1

And there you go ... It's easy to read and edit, and only 1 reference to the table is needed!  Preliminary testing also indicates that this is much more efficient than doing things "SQL 2000-style".

Other challenging SQL statements to write in the pre-2005 world are also easily solved in SQL 2005.  For example, the classic SQL problem of returning the "top x per group" based on a sort, or returning a "row number" or sequence for an entire result set or groups within that set, are now very, very easy using the RANK() feature.

No comments:

Post a Comment