Removing duplicate rows from a table in DB2 in a single query

I have a table with 3 columns as below:

one   |   two    |  three  |   name
------------------------------------
 A1       B1          C1        xyz
 A1       B1          C1        pqr      -> should be deleted
 A1       B1          C1        lmn      -> should be deleted
 A2       B2          C2        abc
 A2       B2          C2        def      -> should be deleted
 A3       B3          C3        ghi
------------------------------------ 

The table is not having any primary key column. I do not have any control over the table and so I can not add any primary key column.

As shown, I want to delete the rows where the combination of one, two and three column is same. So if A1B1C1 is occurring thrice (as in above e.g.), the other two should be deleted and only one should stay.

How to achieve this through just one query in DB2 ?

My requirement is for a single query as I would be running it through a java program.

Answers


(This assumes you're on DB2 for Linux/Unix/Windows, other platforms may vary slightly)

DELETE FROM
    (SELECT ROWNUMBER() OVER (PARTITION BY ONE, TWO, THREE) AS RN
     FROM SESSION.TEST) AS A
WHERE RN > 1;

Should get you what you're looking for.

The query uses the OLAP function ROWNUMBER() to assign a number for each row within each ONE, TWO, THREE combination. DB2 is then able to match the rows referenced by the fullselect (A) as the rows that the DELETE statement should remove from the table. In order to be able to use a fullselect as the target for a delete clause, it has to match the rules for a deletable view (see "deletable view" under the notes section).

Below is some proof (tested on LUW 9.7):

DECLARE GLOBAL TEMPORARY TABLE SESSION.TEST (
    one CHAR(2),
    two CHAR(2),
    three CHAR(2),
    name CHAR(3)
) ON COMMIT PRESERVE ROWS;

INSERT INTO SESSION.TEST VALUES 
    ('A1', 'B1', 'C1', 'xyz'),
    ('A1', 'B1', 'C1', 'pqr'),
    ('A1', 'B1', 'C1', 'lmn'),
    ('A2', 'B2', 'C2', 'abc'),
    ('A2', 'B2', 'C2', 'def'),
    ('A3', 'B3', 'C3', 'ghi');

DELETE FROM
    (SELECT ROWNUMBER() OVER (PARTITION BY ONE, TWO, THREE) AS RN
     FROM SESSION.TEST) AS A
WHERE RN > 1;

SELECT * FROM SESSION.TEST;

Edit 2 March 2017:

In response to the question from Ahmed Anwar, if you need to capture what was deleted, you can also combine the delete with a "data change statement". In this example you could do something like the following, which would give you the "rn" column, one, two, and three:

SELECT * FROM OLD TABLE (
    DELETE FROM
        (SELECT 
             ROWNUMBER() OVER (PARTITION BY ONE, TWO, THREE) AS RN
            ,ONE
            ,TWO
            ,THREE
         FROM SESSION.TEST) AS A
    WHERE RN > 1
) OLD;

DELETE FROM the_table tt
WHERE EXISTS ( SELECT *
    FROM the_table ex
    WHERE ex.one = tt.one
    AND ex.two = tt.two
    AND ex.three = tt.three
    AND ex.zname < tt.zname -- tie-breaker...
    );

Notes: your SQL-dialect may vary. Note2: "name" is a reserved word on some platforms. Better avoid it.


a variation of @a_horse_with_no_name answer db2 for iseries without using group by clause and in clause. It actually works

DELETE from the_table a 
where rrn(a) < (
select max(rrn(a)) from the_table b 
where a.one = b.one and a.two = b.two and a.three = b.three
)

DELETE  FROM Table_Name
WHERE   Table_Name_ID NOT IN ( SELECT  MAX(Table_Name_ID)
                                    FROM    Table_Name
                                    GROUP BY one ,
                                             two, 
                                             three )

one two threee are your repeated columns and Table_Name_ID is PK


Please take backup of table before deleting the data

Delete from table where Name in (select name from table
group by one,two,three
having count(*) > 2)

You can use

     DELETE from TABLE Group by one,two,three Having count(*) > 2; 

This is a variation of levenlevi's answer that does not require a primary key on the table (Can't test the syntax right now thow)

DELETE FROM the_table
WHERE  rid_bit(the_table) NOT IN (SELECT MAX(rid_bit(the_table))
                                  FROM the_table
                                  GROUP BY one,two,three)

I think on iSeries the rid_bit() is not supported, but rrn() save the same purpose


For other using a very old version of db2 SQL: A combination of these posts helped identify and remove the dups from 2 batches posted twice.

SELECT   * FROM     LIBRARY.TABLE a
WHERE    a.batch in (115131, 115287)
AND      EXISTS ( SELECT 1 from LIBRARY.TABLE d 
    WHERE d.batch in (115131, 115287)
     AND a.one = d.one AND a.two = d.two AND a.three = d.three 
    GROUP BY d.one, d.two, d.three 
    HAVING count(*) <> 1 )

    AND RRN(a) > (SELECT MIN(RRN(b)) FROM LIBRARY.TABLE b 
        WHERE b.batch in (115131, 115287)
        AND a.one = b.one AND a.two = b.two AND a.three = b.three );

Need Your Help

Is there a way to prevent google search terms from matching urls?

google-search

At the moment, I am doing a number of searches which include "html" in them, for example "html rearrange". Unfortunately, I get a lot of hits from sites that include "rearrange" on a .html page bu...

Magento recalculate cart total in observer

session magento observer-pattern cart

I have an observer that removes items from the cart if they are out of stock (i.e. customer returns to their cart ofter x time, and an item in the cart has gone out of stock), and shows a message t...