How to Remove Duplicates in SQL: DISTINCT vs ROW_NUMBER

Quick summary

Summarize this blog with AI

Introduction

Remove duplicates in SQL sounds like a single task, but it usually hides a more important question: duplicate according to what key, and which row should survive? If you skip that question, it is easy to write a query that removes rows correctly according to SQL but incorrectly according to the business.

That is why DISTINCT, GROUP BY, and ROW_NUMBER() all seem to compete for the same job. They are not interchangeable. Each solves a different version of the duplicate problem.

The fastest route to a correct fix is to separate three cases: exact duplicate output rows, duplicate business keys where one row should win, and duplicate-looking results that were actually created by a bad join or a broken ingestion path.

If You Only Remember Five Rules

Define the duplicate key before touching the query.
DISTINCT is fine when you truly want unique output rows.
Use ROW_NUMBER() when one preferred row must survive for each business key.
Always make the keep rule deterministic with an explicit ORDER BY.
If duplicates keep returning, the permanent fix is usually upstream, not another cleanup query.

Step 1: Name the Duplicate Key Before You Write the Fix

There are at least three common meanings of duplicate:

The full selected row is repeated exactly.
Several rows represent the same business entity, such as the same customer, session, order, or event.
The base tables are fine, but the query multiplies rows because the join grain is wrong.

If you do not make that distinction first, you risk deleting data that looked duplicated technically but actually carried separate information.

DISTINCT Is for Exact Duplicate Output Rows

If the goal is simply “show me the unique combinations of these selected columns,” DISTINCT is a direct and honest answer.

SELECT DISTINCT customer_id, order_date, amount
FROM staged_orders;

That is not lazy SQL. It is appropriate when exact row uniqueness is the real output requirement. GROUP BY can express the same idea, but DISTINCT is usually clearer when no extra aggregation is needed.

The mistake is using DISTINCT to hide uncertainty about why the duplicates exist in the first place.

ROW_NUMBER Is for Keep One Row Rules

When multiple rows share the same business key and one row should survive, ROW_NUMBER() is the standard pattern.

WITH ranked AS (
    SELECT id,
           customer_id,
           email,
           updated_at,
           ROW_NUMBER() OVER (
               PARTITION BY customer_id, email
               ORDER BY updated_at DESC, id DESC
           ) AS rn
    FROM customer_records
)
SELECT *
FROM ranked
WHERE rn = 1;

The key part is the ORDER BY inside the window. That order expresses the business rule for which row wins. If you want the newest row, say so. If you want the highest-quality row, say so. If ties are possible, add a tie-breaker such as id so the result stays deterministic.

This is closely related to the keep-latest pattern covered in SQL Latest Row Per Group: ROW_NUMBER, QUALIFY, and Tie-Breakers That Keep Results Deterministic.

Dialect Shortcuts: DISTINCT ON and QUALIFY

Some systems offer cleaner syntax for common dedup tasks.

PostgreSQL has DISTINCT ON, which can be elegant when you want one row per key and you are already sorting by the preferred winner.
Snowflake and BigQuery support QUALIFY, which lets you filter window-function results without wrapping them in an extra subquery.

SELECT customer_id, email, updated_at
FROM customer_records
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id, email
    ORDER BY updated_at DESC, id DESC
) = 1;

These shortcuts improve readability, but they do not change the real decision. You still need a clear duplicate key and a clear keep rule.

Delete Duplicate Rows Only After a Preview

Never go from suspicion straight to delete. First preview the losing rows. Make sure the ranking rule matches the business rule. Then delete by explicit row identifiers.

WITH ranked AS (
    SELECT id,
           ROW_NUMBER() OVER (
               PARTITION BY customer_id, email
               ORDER BY updated_at DESC, id DESC
           ) AS rn
    FROM customer_records
)
DELETE FROM customer_records
WHERE id IN (
    SELECT id
    FROM ranked
    WHERE rn > 1
);

If the table is large or business-critical, run that inside a transaction, compare counts before and after, and keep a copy of the losing IDs for auditability.

If the Duplicates Came from a Join, Fix the Join

Many duplicate complaints are actually join problems. A one-to-many join can multiply rows even when both input tables are valid. In that case, a dedup query is treating the symptom, not the cause.

That is why it helps to compare the current problem against Why SQL Joins Create Duplicate Rows and How to Fix Them Without Hiding the Problem. If the rows only became duplicated after a join, the durable fix is often to pre-aggregate one side or to join at the correct business grain.

A similar row-multiplication issue also appears in counting problems such as SQL COUNT with LEFT JOIN: Zero Rows, DISTINCT, and Many-to-Many Traps.

Stop the Problem Upstream

If duplicates keep reappearing after cleanup, the cleanup query is not the true fix. Look upstream for one of the usual causes:

An append-only load that should have been an upsert or merge.
A retry process that can ingest the same payload multiple times.
A missing uniqueness constraint on the actual business key.
An unreliable incremental watermark.
A source feed that does not expose a stable event identifier.

The best long-term solution is usually a guardrail: a uniqueness constraint, idempotent merge key, audited staging table, or upstream dedup logic that prevents the bad rows from reaching the analytics table in the first place.

FAQ

Is DISTINCT bad for performance?

No. It has a cost because the engine must compare rows, but that does not make it wrong. Use it when exact row uniqueness is the actual output requirement.

How do I choose the ORDER BY for ROW_NUMBER?

Use the rule that defines the correct survivor, such as newest update, highest quality score, or latest ingestion timestamp. Then add a tie-breaker so the outcome stays stable.

Should I deduplicate before or after aggregating?

The business grain decides that. If duplicate source rows would inflate the metric, deduplicate first. If the grouped result is the intended output and raw duplicates are valid inputs, aggregate first.