In the second part of the article we will talk about the functions themselves, which are used to form values. The window function calculates the value from a set of data associated with the current row, that is, data from the same group, if Partition by is used. Conventional aggregate functions for grouping require the grouping of rows, thus losing the necessary unique information from the sample. Therefore, it is necessary to use two instead of a single request in order to have all the necessary data and the sum by groups. Window aggregate functions allow to achieve the same result in one query.
Let me remind you that a window is a set of lines by which a function is calculated. The OVER instruction splits the entire set of rows into separate groups — windows according to the specified condition.
Let's talk about the types of window functions. There are three groups of destination:
• Aggregate functions: SUM (), MAX (), MIN (), AVG (). COUNT (). These functions return the value obtained by arithmetic calculations;
• Ranking functions: RANK (), DENSE_RANK (), ROW_NUMBER (), NTILE (). Allow to get the sequence number of entries in the window;
• Offset functions: LAG (), LEAD (), FIRST_VALUE (), LAST_VALUE (). Return a value from another window row.
To demonstrate how the functions work, I will use a simple table:
1 2 3 4 5 6 7 |
CREATE TABLE ForWindowFunc (ID INT, GroupId INT, Amount INT) GO INSERT INTO ForWindowFunc (ID, GroupId, Amount) VALUES(1, 1, 100), (1, 1, 200), (1, 2, 150), (2, 1, 100), (2, 1, 300), (2, 2, 200), (2, 2, 50), (3, 1, 150), (3, 2, 200), (3, 2, 10); |
Aggregate functions
SUM ()
The SUM () function works just like a regular aggregate function — it sums up all the values of a given column in a data set. However, thanks to the OVER () instruction, we break the data set into windows. The summation is performed inside the windows according to the order specified in the ORDER BY clause. Let's look at a simple example - the sum of the three groups.
1 2 3 |
SELECT ID, Amount, SUM(Amount) OVER (ORDER BY id) AS SUM FROM ForWindowFunc |
ID |
Amount |
Sum |
1 |
100 |
450 |
1 |
200 |
450 |
1 |
150 |
450 |
2 |
100 |
650 |
2 |
300 |
650 |
2 |
200 |
650 |
2 |
50 |
650 |
3 |
150 |
360 |
3 |
200 |
360 |
3 |
10 |
360 |
For convenience, the windows are highlighted in different colors. All values in the window have the same amount - the sum of all Amount in the window.
Let's add another column to the selection and change the OVER instruction:
1 2 3 4 5 |
SELECT ID, GroupId, Amount, SUM(Amount) OVER (Partition BY id ORDER BY id, GroupId) AS SUM FROM ForWindowFunc |
ID |
GroupId |
Amount |
Sum |
1 |
1 |
100 |
300 |
1 |
1 |
200 |
300 |
1 |
2 |
150 |
450 |
2 |
1 |
100 |
400 |
2 |
1 |
300 |
400 |
2 |
2 |
200 |
650 |
2 |
2 |
50 |
650 |
3 |
1 |
150 |
150 |
3 |
2 |
200 |
360 |
3 |
2 |
10 |
360 |
As you can see, now each window is divided into groups thanks to the GroupId field. Each group now has its own amount.
And now, let's make a cumulative result inside each window:
1 2 3 4 5 |
SELECT ID, GroupId, Amount, SUM(Amount) OVER (Partition BY id ORDER BY id, GroupId, Amount) AS SUM FROM ForWindowFunc |
ID |
Amount |
Sum |
1 |
100 |
100 |
1 |
200 |
300 |
1 |
150 |
450 |
2 |
100 |
100 |
2 |
300 |
400 |
2 |
50 |
450 |
2 |
200 |
650 |
3 |
150 |
150 |
3 |
10 |
160 |
3 |
200 |
360 |
We no longer need the GroupId field, so we removed it from the selection. Now, for each line in the window, a total is calculated, which is the sum of the current value of the Amount and all previous ones.
AVG()
This function calculates the average value. It can be used with the sentences Partition by and Order by.
1 2 3 4 |
SELECT ID, Amount, AVG(Amount) OVER (Partition BY id ORDER BY id) AS AVG FROM ForWindowFunc |
ID |
Amount |
AVG |
1 |
100 |
150 |
1 |
200 |
150 |
1 |
150 |
150 |
2 |
100 |
162 |
2 |
300 |
162 |
2 |
200 |
162 |
2 |
50 |
162 |
3 |
150 |
120 |
3 |
200 |
120 |
3 |
10 |
120 |
Each line in the window has an average value of Amount, which is calculated by the formula: the sum of all Amount / for the number of lines.
The behavior of this function is similar to SUM ().
MIN()
From the name of the function it is clear that it returns the minimum value in the window.
1 2 3 4 |
SELECT ID, Amount, MIN(Amount) OVER (Partition BY id ORDER BY id) AS MIN FROM ForWindowFunc |
ID |
Amount |
Min |
1 |
100 |
100 |
1 |
200 |
100 |
1 |
150 |
100 |
2 |
100 |
50 |
2 |
300 |
50 |
2 |
200 |
50 |
2 |
50 |
50 |
3 |
150 |
10 |
3 |
200 |
10 |
3 |
10 |
10 |
As you can see, in the Min column, the minimum Amount value is displayed in the window.
MAX()
The MAX function works in the same way as MIN, it only gives the maximum value of the field in the window:
1 2 3 4 |
SELECT ID, Amount, MAX(Amount) OVER (Partition BY id ORDER BY id) AS MAX FROM ForWindowFunc |
ID |
Amount |
Max |
1 |
100 |
200 |
1 |
200 |
200 |
1 |
150 |
200 |
2 |
100 |
300 |
2 |
300 |
300 |
2 |
200 |
300 |
2 |
50 |
300 |
3 |
150 |
200 |
3 |
200 |
200 |
3 |
10 |
200 |
Everything is very clear. In the first group, the maximum Amount is 200, in the second 300, and in the third - 200.
COUNT()
This function returns the number of lines in a window.
1 2 3 |
SELECT ID, Amount, COUNT(Amount) OVER (Partition BY id ORDER BY id) AS COUNT FROM ForWindowFunc |
ID |
Amount |
Count |
1 |
100 |
3 |
1 |
200 |
3 |
1 |
150 |
3 |
2 |
100 |
4 |
2 |
300 |
4 |
2 |
200 |
4 |
2 |
50 |
4 |
3 |
150 |
3 |
3 |
200 |
3 |
3 |
10 |
3 |
Let’s make the query a bit more complicated and add the GroupId field.
1 2 3 4 5 |
SELECT ID, GroupId, Amount, COUNT(Amount) OVER (Partition BY id ORDER BY id, GroupId) AS COUNT FROM ForWindowFunc |
ID |
GroupId |
Amount |
Count |
1 |
1 |
100 |
2 |
1 |
1 |
200 |
2 |
1 |
2 |
150 |
3 |
2 |
1 |
100 |
2 |
2 |
1 |
300 |
2 |
2 |
2 |
200 |
4 |
2 |
2 |
50 |
4 |
3 |
1 |
150 |
1 |
3 |
2 |
200 |
3 |
3 |
2 |
10 |
3 |
In this case, it is more interesting. Let's look at the first window. For the first and second lines, the number of records was 2. But for the third line, the value is already 3. We managed to accumulate the amount in groups like the cumulative amount.
If we still want the number in each group, then GroupId needs to be added to the Partition by clause.
1 2 3 4 5 |
SELECT ID, GroupId, Amount, COUNT(Amount) OVER (Partition BY id, GroupId) AS COUNT FROM ForWindowFunc |
ID |
GroupId |
Amount |
Count |
1 |
1 |
100 |
2 |
1 |
1 |
200 |
2 |
1 |
2 |
150 |
1 |
2 |
1 |
100 |
2 |
2 |
1 |
300 |
2 |
2 |
2 |
200 |
2 |
2 |
2 |
50 |
2 |
3 |
1 |
150 |
1 |
3 |
2 |
200 |
2 |
3 |
2 |
10 |
2 |
Ranking functions
RANK () / DENSE_RANK ()
The RANK () function returns the sequence number of the current row in the window. However, there is a feature. If the Order By clause contains several equivalent strings for a rule, then all of them will be considered the current string. Thus, the RANK () function should be used for ranking, not rowing. However, if you correctly set the Order by, then you can also number the physical strings. For example:
1 2 3 4 5 |
SELECT ID, GroupId, Amount, RANK() OVER (Partition BY id ORDER BY id, GroupId, Amount) AS RANK FROM ForWindowFunc |
ID |
GroupId |
Amount |
RANK |
1 |
1 |
100 |
1 |
1 |
1 |
200 |
2 |
1 |
2 |
150 |
3 |
2 |
1 |
100 |
1 |
2 |
1 |
300 |
2 |
2 |
2 |
50 |
3 |
2 |
2 |
200 |
4 |
3 |
1 |
150 |
1 |
3 |
2 |
10 |
2 |
3 |
2 |
200 |
3 |
Here is the case with the same lines in the context of Order by:
1 2 3 4 5 |
SELECT ID, GroupId, Amount, RANK() OVER (Partition BY id ORDER BY id, GroupId) AS RANK FROM ForWindowFunc |
ID |
GroupId |
Amount |
RANK |
1 |
1 |
100 |
1 |
1 |
1 |
200 |
1 |
1 |
2 |
150 |
3 |
2 |
1 |
100 |
1 |
2 |
1 |
300 |
1 |
2 |
2 |
200 |
3 |
2 |
2 |
50 |
3 |
3 |
1 |
150 |
1 |
3 |
2 |
200 |
2 |
3 |
2 |
10 |
2 |
It's interesting that the third line in the first window has a rank of 3, although the previous two lines are assigned to the first rank. Not the most understandable logic. In this case, it is better to use DENSE_RANK ().
1 2 3 4 5 |
SELECT ID, GroupId, Amount, DENSE_RANK() OVER (Partition BY id ORDER BY id, GroupId) AS DENSE_RANK FROM ForWindowFunc |
ID |
GroupId |
Amount |
DENSE_RANK |
1 |
1 |
100 |
1 |
1 |
1 |
200 |
1 |
1 |
2 |
150 |
2 |
2 |
1 |
100 |
1 |
2 |
1 |
300 |
1 |
2 |
2 |
200 |
2 |
2 |
2 |
50 |
2 |
3 |
1 |
150 |
1 |
3 |
2 |
200 |
2 |
3 |
2 |
10 |
2 |
Now everything is as it should be. DENSE_RANK () does not skip ranks if the previous rank contains several lines.
The functions RANK () and DENSE_RANK () do not require indication of the field in brackets.
ROW_NUMBER()
The ROW_NUMBER () function displays the current row number in the window. Like the previous two functions, ROW_NUMBER () does not require specifying a field in parentheses.
1 2 3 4 5 |
SELECT ID, GroupId, Amount, ROW_NUMBER() OVER (Partition BY id ORDER BY id, GroupId, Amount) AS ROW_NUMBER FROM ForWindowFunc |
ID |
GroupId |
Amount |
ROW_NUMBER |
1 |
1 |
100 |
1 |
1 |
1 |
200 |
2 |
1 |
2 |
150 |
3 |
2 |
1 |
100 |
1 |
2 |
1 |
300 |
2 |
2 |
2 |
50 |
3 |
2 |
2 |
200 |
4 |
3 |
1 |
150 |
1 |
3 |
2 |
10 |
2 |
3 |
2 |
200 |
3 |
In the query, we used Partition by to divide the data set into groups. Everything is clear here and should not cause questions.
If you do not use Partition by, then get through numbering throughout the data set:
1 2 3 4 5 |
SELECT ID, GroupId, Amount, ROW_NUMBER() OVER (ORDER BY id, GroupId, Amount) AS ROW_NUMBER FROM ForWindowFunc |
ID |
GroupId |
Amount |
ROW_NUMBER |
1 |
1 |
100 |
1 |
1 |
1 |
200 |
2 |
1 |
2 |
150 |
3 |
2 |
1 |
100 |
4 |
2 |
1 |
300 |
5 |
2 |
2 |
50 |
6 |
2 |
2 |
200 |
7 |
3 |
1 |
150 |
8 |
3 |
2 |
10 |
9 |
3 |
2 |
200 |
10 |
In fact, the absence of the Partition by clause says that the entire data set is a window.
NTILE()
The NTILE () function allows you to determine which group the current line belongs to. The number of groups is specified in brackets, and the ORDER BY clause determines which column is used to define the group.
For example, this means that if you have 100 lines and you want to create 4 quartiles based on the specified value field, you can easily do this and see how many lines fall into each quartile.
Let's have a look at the example. In the query below, we indicated that we want to create four quartiles based on the order amount. Then we want to see how many orders fall into each quartile.
NTILE creates tiles based on the following formula:
Number of lines in each group = number of lines in the set / number of specified groups
Here is our example: the request contains only 10 lines and 4 tiles, so the number of lines in each plate will be 2.5 (10/4). Because the number of lines must be an integer, not decimal. The SQL engine will assign 3 rows for the first two groups and 2 rows for the remaining two groups.
1 2 3 |
SELECT Amount, NTILE(4) OVER(ORDER BY amount) AS Ntile FROM ForWindowFunc |
Amount |
Ntile |
10 |
1 |
50 |
1 |
100 |
1 |
100 |
2 |
150 |
2 |
150 |
2 |
200 |
3 |
200 |
3 |
200 |
4 |
300 |
4 |
This is a very simple example, but it demonstrates the function well. All Amount values are sorted in ascending order and divided into 4 groups.
Offset functions
LAG() and LEAD()
These two functions allow you to get the previous and next value, respectively. Quite often it is necessary to compare the current value with the previous or next in the calculated columns.
As parameters, you can pass to the function the name of the field and the number of lines that you need to deviate from the current one and take the value. As in SUBSTRING (), we specify the position from which to take characters, and here we indicate the position from which to take the value. If you do not specify the number of values, then the default is one.
So, the LAG function allows you to access data from the previous line in one window.
1 2 3 |
SELECT id, Amount, LAG(Amount) OVER(ORDER BY id, amount) AS Lag FROM ForWindowFunc |
id |
Amount |
Lag |
1 |
100 |
NULL |
1 |
150 |
100 |
1 |
200 |
150 |
2 |
50 |
200 |
2 |
100 |
50 |
2 |
200 |
100 |
2 |
300 |
200 |
3 |
10 |
300 |
3 |
150 |
10 |
3 |
200 |
150 |
In the first line, the value of the Lag field is definitely Null because there is no previous Amount value for this line. For all subsequent lines, the value of Amount from the previous line is taken.
The LEAD function works in the same way, but in the other direction - it takes the value from the next line.
1 2 3 |
SELECT id, Amount, LEAD(Amount,2) OVER(ORDER BY id, amount) AS Lag FROM ForWindowFunc |
id |
Amount |
Lag |
1 |
100 |
200 |
1 |
150 |
50 |
1 |
200 |
100 |
2 |
50 |
200 |
2 |
100 |
300 |
2 |
200 |
10 |
2 |
300 |
150 |
3 |
10 |
200 |
3 |
150 |
NULL |
3 |
200 |
NULL |
As you can see, in the query we pass the parameter 2 to the LEAD function. This means that we get the second value from the current Amount. For the last two lines, the value is Null, since for them there are no following values.
FIRST_VALUE() и LAST_VALUE()
With these functions, we can get the first and last value in the window. If the Partition by clause is not specified, then the functions will return the first and last value of the entire data set.
1 2 3 |
SELECT id, Amount, FIRST_VALUE(Amount) OVER(Partition BY Id ORDER BY Id, amount) AS FIRST FROM ForWindowFunc |
id |
Amount |
First |
1 |
100 |
100 |
1 |
150 |
100 |
1 |
200 |
100 |
2 |
50 |
50 |
2 |
100 |
50 |
2 |
200 |
50 |
2 |
300 |
50 |
3 |
10 |
10 |
3 |
150 |
10 |
3 |
200 |
10 |
Here we got the first value for each window.
And now we get the first value across the entire data set:
1 2 3 |
SELECT id, Amount, FIRST_VALUE(Amount) OVER(ORDER BY Id, amount) AS FIRST FROM ForWindowFunc |
id |
Amount |
First |
1 |
100 |
100 |
1 |
150 |
100 |
1 |
200 |
100 |
2 |
50 |
100 |
2 |
100 |
100 |
2 |
200 |
100 |
2 |
300 |
100 |
3 |
10 |
100 |
3 |
150 |
100 |
3 |
200 |
100 |
We removed the Partition clause from the query, thus we defined the entire data set as a window.
And now let's look at the work of the LAST_VALUE function:
1 2 3 |
SELECT id, Amount, LAST_VALUE(Amount) OVER(ORDER BY id) AS LAST FROM ForWindowFunc |
id |
Amount |
Last |
1 |
100 |
150 |
1 |
200 |
150 |
1 |
150 |
150 |
2 |
100 |
50 |
2 |
300 |
50 |
2 |
200 |
50 |
2 |
50 |
50 |
3 |
150 |
10 |
3 |
200 |
10 |
The query is almost the same as the previous one, but the result is completely different. Since we do not have a unique identifier in the table, we cannot sort the dataset by it. Sorting by field Id actually divided the data into three groups. And the function returned the last value for each of them – this is a feature of the function.
With this we will complete consideration of window functions. The examples given are greatly simplified specifically for a better understanding of the way the function works. Practical tasks are often more difficult, so it is important to understand well the behavior of a function depending on the sentences in the OVER instruction.