Back-end Engineering Articles

I write and talk about backend stuff like Ruby, Ruby On Rails, Databases, Testing, Architecture / Infrastructure / System Design, Cloud, DevOps, Backgroud Jobs, some JS stuff and more...

Github:
/danielmoralesp
Twitter:
@danielmpbp

2025-04-01

Introduction to SQL - Aggregates And Grouping

We’ve been advancing a lot in SQL over the last blog posts. Now we’ll be studying to important concepts in relational databases like Aggregate Functions and Grouping

Note: to follow this tutorial we’ll continue with databases created in Introduction to SQL, Constraints and Queries. So I encourage you to follow these posts to have the right context. 

Aggregate Functions
We have learned how to write database queries, now let's learn how to do calculations using SQL. When we run calculations on multiple rows of a table we call them "Aggregates". In database management an aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning

Let’s first go into the PostgreSQL console

sudo su - postgres
[sudo] password for daniel: 
postgres@daniel-MS-7A15:~$ psql
psql (13.3 (Ubuntu 13.3-1.pgdg16.04+1), server 9.6.22)
Type "help" for help.

postgres=# \c marvel_tshirts
psql (13.3 (Ubuntu 13.3-1.pgdg16.04+1), server 9.6.22)
You are now connected to database "marvel_tshirts" as user "postgres".


Then let’s inspect what we have in users table. 

marvel_tshirts=# SELECT * FROM users;
 id |  name  | age 
----+--------+-----
  1 | Justin |  22
  2 | Ana    |  21
  4 | Viktor |  39
(3 rows)



Now that we know what we have, let’s go ahead with Aggregate Functions

COUNT()
  • COUNT() is the fastest way to calculate how many rows are in a table.
  • COUNT() is a function that takes the name of a column as an argument and counts the number of non-empty values in that column.


marvel_tshirts=# SELECT COUNT(*)
marvel_tshirts-# FROM users;
 count 
-------
     3
(1 row)

This means that we have 3 users



  • COUNT() include duplicate values of a column
  • To run a query with non-duplicate values


marvel_tshirts=# SELECT COUNT (DISTINCT name)
marvel_tshirts-# FROM users;
 count 
-------
     3
(1 row)


From now we don’t have any repeated names in the users table. However if you have any, that the COUNT combined with DISTINCT will avoid duplicate records. 

This means that we have 3 distinct users


SUM()
  • SUM() is a function that takes the name of a column as an argument and returns the sum of all the values in that column.

marvel_tshirts=# SELECT SUM (age)
marvel_tshirts-# FROM users;
 sum 
-----
  82
(1 row)


This means that we sum up of the ages of the 3 users are 82


MAX() and MIN()
  • The MAX() and MIN() functions return the highest or lowest value in the column respectively.
  • MAX() takes the name of the column as an argument and returns the highest value in that column.
  • MIN() works the same but in the opposite way, it returns the lower value


marvel_tshirts=# SELECT MAX (age)
marvel_tshirts-# FROM users;
 max 
-----
  39
(1 row)

marvel_tshirts=# SELECT MIN (age)
marvel_tshirts-# FROM users;
 min 
-----
  21
(1 row)



39 is the max age of the users, and 21 and the youngest user. 

AVG()
It stands for average. SQL uses the AVG() function to quickly calculate the average value of a particular column. The AVG() function takes the name of a column as an argument and returns the average value of the given column.

marvel_tshirts=# SELECT AVG(age)
marvel_tshirts-# FROM users;
         avg         
---------------------
 27.3333333333333333
(1 row)


The average age of our users are 27.33, however we see a large number… How can we round it?

ROUND()
  • By default SQL tries to be as accurate as possible without rounding. But we can make the results easier to read using the ROUND() function
  • ROUND() takes two arguments inside the parenthesis: the column name and an integer.
  • The integer functions as the number of decimal places we want in the result.

In the last exercise we can use it as follows:

marvel_tshirts=# SELECT ROUND(AVG(age), 0)
marvel_tshirts-# FROM users;
 round 
-------
    27
(1 row)

marvel_tshirts=# SELECT ROUND(AVG(age), 1)
FROM users;
 round 
-------
  27.3
(1 row)

marvel_tshirts=# SELECT ROUND(AVG(age), 2)
FROM users;
 round 
-------
 27.33
(1 row)

marvel_tshirts=# SELECT ROUND(AVG(age), 3)
FROM users;
 round  
--------
 27.333
(1 row)






As we can see the first argument is the result of the AVG() operation which is 27.3333333… and then we pass a second argument which is the number of decimal numbers we want. If we puto zero, we’ll have an integer, and if we pass 1 or more we’ll have a float given the number of decimals. 

That's the main aggregate function. There are more, but for now I think it is enough to have the right amount of tools at hand. Now we’ll see other kind of important operation: Group By

GROUP BY
At some point we will need to compute aggregate functions for data with certain characteristics. For example, "find the number of customers in each country". 
  • GROUP BY is a clause in SQL used with aggregate functions. It is used in collaboration with the SELECT clause to bring identical data into groups.
  • The GROUP BY clause goes after any WHERE, but before ORDER BY or LIMIT. 

Previously we’ve created a table called Friends with next columns and rows

marvel_tshirts=# SELECT * FROM friends;
 id |   name   |  birthday  |        email        
----+----------+------------+---------------------
  2 | Emiliano | 2006-09-26 | emilianitogmail.com
  3 | Maria    | 1986-09-02 | [email protected]
  4 | Maria    | 1986-09-02 | 
  5 | Maria    | 1986-09-02 | 
(4 rows)



Now we execute the following group by

marvel_tshirts=# SELECT name, MIN(birthday)
marvel_tshirts-# FROM friends
marvel_tshirts-# GROUP BY name
marvel_tshirts-# ORDER BY name;
   name   |    min     
----------+------------
 Emiliano | 2006-09-26
 Maria    | 1986-09-02
(2 rows)


It's a little weird our current data, but what it means is
  • We want to group by each name taking into account the MIN value of the birthday.

Let’s do another example. Let’s check this code


Let’s suppose we have a table called Movies. And inside it we have the year of release and imdb_rating of each one of them. Imdb_rating is a Rating given by users from the IMDB webpage 

What this code will do is to group the movies by year, taking into account the average of the rating

Year | Rating
2005 | 6.75
2006 | 6.71
2007 | 6.78
2008 | 7.00
2009 | 6.67
2010 | 6.92

This result is meaningful because we can see that the best year measured by the imdb_rating was 2008, which means the movies released that year were so good, because they were rated very well on average. At the same time 2009 was the worst in comparison with the others. Ok, I think you got it. Group By is very useful and later we’ll be using it a lot. 

I think we have enough for this time. I hope you learned a lot

Thanks for reading
Daniel Morales