SQL

From bibbleWiki
Jump to navigation Jump to search

Introduction

This page is here to remind me the various SQL options. I am having a test on SQL and find writing stuff down helps me not get frustrated or fluster. Ridiculous really

Order of Operations

So I guess the main thing is the general ordering of the operations which is

  • SELECT
  • FROM
  • JOIN
  • WHERE
  • GROUP BY
  • HAVING
  • ORDER BY
  • LIMIT
  • OFFSET

Offset was new to me although had used it in my REST APIs. For Microsoft you are required to keep to the standard and us ROWS. An example might be

SELECT *
FROM logs
ORDER BY start_date
OFFSET 20 ROWS
FETCH NEXT 10 ROWS ONLY

Join

There are 4 main types of joins

  • LEFT JOIN - All Left records will appear in results + any matches on the right which could be more than 1:1
  • RIGHT JOIN - All Right records will appear in the results + any matches on the left which could be more than 1:1
  • INNER JOIN - Only records which appear in the left and the right are selected. Where there is no value for a column, NULL are used
  • FULL OUTER JOIN - All records from both sides appear in the results. Where there is no value for a column, NULL are used

Here is a pretty picture to explain. I most of my life I have only ever used LEFT JOIN

Indexes: PK and FK (Microsoft)

Primary Keys (PK)

SQL Server automatically creates an index for every PRIMARY KEY.

- If the table has no clustered index, the PK becomes the clustered index. - If a clustered index already exists, the PK becomes a non‑clustered index.

So PKs are always indexed.

Foreign Keys (FK)

SQL Server does NOT automatically create indexes for FOREIGN KEY columns.

The FK constraint enforces the relationship but does not add an index. If you want fast joins and fast deletes/updates on the parent table, you must create the index manually:

    CREATE INDEX IX_child_fk ON child_table(fk_column);

Always index your foreign keys.

Execution Plans (PostgreSQL/MS SQL)

In SQL Server we use SHOWPLAN or STATISTICS PROFILE to see how a query will run.

In PostgreSQL the equivalent is:

    EXPLAIN <query>;

This shows the planned execution strategy.

To see the actual execution with real timings, use:

    EXPLAIN ANALYZE <query>;

This runs the query and shows: - the plan - actual row counts - actual time spent - which steps were slow

This is the PostgreSQL equivalent of SQL Server's "Actual Execution Plan".

Missing Indexes

In Oracle you can do EXPLAIN PLAN for PostgreSQL we use EXPLAIN ANALYZE. If we had a table as shown with no indexes

  id  | reason_id | start_date
------+-----------+------------
 1869 |         2 | 2023-06-28
 1869 |         2 | 2023-06-23
 1869 |         2 | 2023-06-26
 1869 |         2 | 2023-06-27
 3546 |         2 | 2023-06-30
 3546 |         2 | 2023-06-29
 3546 |         2 | 2023-06-30
 7777 |         2 | 2023-06-23
 7777 |         2 | 2023-06-26
 7777 |         2 | 2023-06-28
(10 rows)

And did

EXPLAIN ANALYZE
SELECT *
FROM logs
WHERE reason_id = 2;

You would see

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Seq Scan on logs  (cost=0.00..35.50 rows=10 width=12) (actual time=0.010..0.030 rows=10 loops=1)
   Filter: (reason_id = 2)
 Planning Time: 0.044 ms
 Execution Time: 0.066 ms

However if you create an index on the reason_id with

CREATE INDEX idx_logs_reason_id ON logs(reason_id);

You will still see the Seq Scan as PostgreSQL using Seq Scan on small tables. I create a bigger table using this

CREATE TABLE big_logs (
    id          INTEGER NOT NULL,
    reason_id   INTEGER NOT NULL,
    start_date  DATE NOT NULL
);

INSERT INTO big_logs (id, reason_id, start_date)
SELECT
    (random() * 5000)::int,              -- random IDs
    (random() * 10)::int,                -- random reason_id 0–10
    date '2023-01-01' + (random() * 365)::int
FROM generate_series(1, 100000);

CREATE INDEX idx_big_logs_reason_id ON big_logs(reason_id);

EXPLAIN ANALYZE
SELECT *
FROM big_logs
WHERE reason_id = 2;

And the EXPLAIN ANALYZE produced

                                                               QUERY PLAN                                               
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on big_logs  (cost=116.27..786.27 rows=10320 width=12) (actual time=0.276..17.008 rows=10120 loops=1)
   Recheck Cond: (reason_id = 2)
   Heap Blocks: exact=541
   ->  Bitmap Index Scan on idx_big_logs_reason_id  (cost=0.00..113.69 rows=10320 width=0) (actual time=0.214..0.215 rows=10120 loops=1)
         Index Cond: (reason_id = 2)
 Planning Time: 0.087 ms
 Execution Time: 31.812 ms

Poor Subqueries in PostgreSQL

PostgreSQL does not optimise subqueries as aggressively as Oracle or SQL Server. If you nest queries, PostgreSQL often executes them literally.

This can cause: - repeated scans - repeated work - no index usage - slow nested loops

Subqueries to avoid

1. Subqueries in SELECT

  These run once per row.
  Better: use JOIN.
SELECT
    id,
    (SELECT name FROM users WHERE users.id = logs.user_id)
FROM logs;

2. IN / NOT IN subqueries

  PostgreSQL materialises the subquery and scans it.
  Better: JOIN or EXISTS.
SELECT *
FROM logs
WHERE reason_id IN (SELECT reason_id FROM reasons);

3. Correlated subqueries

  These run the inner query for every row of the outer query.
  Better: window functions or DISTINCT ON.
SELECT *
FROM logs l
WHERE start_date = (
    SELECT MAX(start_date)
    FROM logs l2
    WHERE l2.id = l.id
);

How to spot a poor subquery

In EXPLAIN ANALYZE you will see:

   Nested Loop
     -> Seq Scan
     -> Seq Scan

This means the subquery prevented the optimiser from using indexes.

Rule of thumb

Avoid subqueries unless you need them. JOINs, EXISTS, window functions, and DISTINCT ON are usually faster.

CTEs

A Common Table Expression (CTE) in SQL is a temporary, named result set that you can reference within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. An example of usage can be found below

Fancy PostgreSQL

The one thing you can say about SQL is you will get the same result. They are pure functions.

Duplicates

So this is an example of using a CTE. Basically they look like subqueries and I guess make debugging a little bit easier for the developer. I had not see this until my last role and assumed it was an MS only thing but later found it was added in 2005. So to find duplicates we use the ROW_NUMBER() and partition. This adds a column rn with a incrementing row number which is reset each time column1 or column2 change.

WITH dups AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
    FROM customers
)
SELECT *
FROM dups
WHERE rn > 1;

So given the data

music_db=> select * from customers;
 id |       email       |      name
----+-------------------+-----------------
  1 | alice@example.com | Alice
  2 | bob@example.com   | Bob
  3 | alice@example.com | Alice Duplicate
  4 | carol@example.com | Carol
  5 | bob@example.com   | Bob Duplicate
  6 | dave@example.com  | Dave
(6 rows)

The CTE (dups) returns

SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn FROM customers;
 id |       email       |      name       | rn
----+-------------------+-----------------+----
  1 | alice@example.com | Alice           |  1
  3 | alice@example.com | Alice Duplicate |  2
  2 | bob@example.com   | Bob             |  1
  5 | bob@example.com   | Bob Duplicate   |  2
  4 | carol@example.com | Carol           |  1
  6 | dave@example.com  | Dave            |  1
(6 rows)

And the final select gives

 id |       email       |      name       | rn
----+-------------------+-----------------+----
  3 | alice@example.com | Alice Duplicate |  2
  5 | bob@example.com   | Bob Duplicate   |  2
(2 rows)

Consecutive Days

In my time a one site I need to report on consecutive Days. So our question is how many consecutive days are these for id 1869. Here is the original data

  id  | reason_id | start_date
------+-----------+------------
 1869 |         2 | 2023-06-28
 1869 |         2 | 2023-06-23
 1869 |         2 | 2023-06-26
 1869 |         2 | 2023-06-27
 3546 |         2 | 2023-06-30
 3546 |         2 | 2023-06-29
 3546 |         2 | 2023-06-30
 7777 |         2 | 2023-06-23
 7777 |         2 | 2023-06-26
 7777 |         2 | 2023-06-28
(10 rows)

To find this out we need to add previous start date. We do this by using partitioning and ROW_NUMBER().

SELECT
    id,
    reason_id,
    start_date,
    start_date - (
        ROW_NUMBER() OVER (
            PARTITION BY id, reason_id
            ORDER BY start_date
        ) * INTERVAL '1 day'
    ) AS prev_start_date
FROM logs
ORDER BY id, start_date;

The PARTITION BY id, reason_id means each time either of these change the ROW_NUMBER() is set back to 1. Looking at just id 777

 7777 |         2 | 2023-06-23
 7777 |         2 | 2023-06-26
 7777 |         2 | 2023-06-28

We have 3 records so the previous previous day becomes

 
start_date - 1 day
start_date - 2 days
start_date - 3 days

So the result for all records in this query is

 
------+-----------+------------+---------------------
 1869 |         2 | 2023-06-23 | 2023-06-22 00:00:00
 1869 |         2 | 2023-06-26 | 2023-06-24 00:00:00
 1869 |         2 | 2023-06-27 | 2023-06-24 00:00:00
 1869 |         2 | 2023-06-28 | 2023-06-24 00:00:00
 3546 |         2 | 2023-06-29 | 2023-06-28 00:00:00
 3546 |         2 | 2023-06-30 | 2023-06-28 00:00:00
 3546 |         2 | 2023-06-30 | 2023-06-27 00:00:00
 7777 |         2 | 2023-06-23 | 2023-06-22 00:00:00 <- start_date - 1
 7777 |         2 | 2023-06-26 | 2023-06-24 00:00:00 <- start_date - 2
 7777 |         2 | 2023-06-28 | 2023-06-25 00:00:00 <- start_date - 3
(10 rows)

So now we can use this to find consecutive days by grouping but counting the group by prev_start_date

SELECT
    id,
    reason_id,
    MAX(start_date) AS start_date,
    MIN(start_date) AS end_date,
    COUNT(prev_start_date) AS continuous_days
FROM (
    SELECT
        id,
        reason_id,
        start_date,
        start_date
          - (ROW_NUMBER() OVER (
                PARTITION BY id, reason_id
                ORDER BY start_date
            ) * INTERVAL '1 day') AS prev_start_date
    FROM logs
) AS t
GROUP BY
    id,
    reason_id,
    prev_start_date
ORDER BY
    id,
    prev_start_date;

SQL Code Review Guidance

The Long Version

This section covers what to look for when reviewing SQL. The goal is to catch performance issues, readability problems, and logic bugs early.

1. Check the WHERE clause

- Is the filter selective? - Does it match an index? - Are there functions on indexed columns? (e.g. LOWER(col)) - Are implicit casts happening? (e.g. comparing text to int)

2. Look for missing indexes

- Seq Scan on large tables is a red flag. - Foreign key columns should normally be indexed. - Composite indexes must match the leftmost column rule.

3. Watch for poor subqueries

- Subqueries in SELECT run once per row. - Correlated subqueries often cause nested loops + seq scans. - Prefer JOIN, EXISTS, or window functions.

4. Check JOIN logic

- Is the join type correct? (INNER vs LEFT) - Are join conditions correct and complete? - Are unnecessary cross joins avoided?

5. Pagination patterns

- LIMIT/OFFSET is fine for small pages. - Deep OFFSET is slow because skipped rows are still read. - Consider keyset pagination for large datasets.

6. Look at the execution plan

- Seq Scan on large tables = likely missing index. - Bitmap Heap Scan = index used but medium selectivity. - Index Scan = highly selective, good. - Check row estimates vs actual rows.

7. Avoid anti-patterns

- SELECT * in production queries. - OR conditions that kill index usage. - Wildcards at the start of LIKE ('%abc'). - Unbounded queries with no WHERE clause.

8. Check for readability

- Clear table aliases. - Consistent indentation. - Logical grouping of conditions. - Avoid overly clever SQL; prefer clarity.

9. Validate correctness

- Are date ranges inclusive/exclusive as intended? - Are NULLs handled correctly? - Are time zones considered? - Are duplicates possible?

10. Keep business logic out of SQL

SQL should query data, not implement business rules. If logic becomes complex, move it to the application layer.

The Short Version

  • Does the WHERE clause match an index?
  • Is there a Seq Scan on a large table?
  • Are JOINs correct and selective?
  • Are subqueries avoidable?
  • Is pagination efficient?
  • Is the SQL readable and maintainable?