AWK reporting duplicate lines and count, program explanation

I found the following AWK program on the internet and tweaked it slightly to look at column $2:

{ a[$2,NR]=$0; c[$2]++ }
END {
    for( k in a ) {

       split(k,b,SUBSEP)

       t=c[b[1]] # added this bit to capture count

       if( b[1] in c && t>1 ) { # added && t>1 only print if count more than 1
         print RS "TIMES  ID" RS c[b[1]] "  " b[1] RS
         delete c[b[1]]
       }

       for(i=1;i<=NR;i++) if( a[b[1],i] ) {
          if(t>1){print a[b[1],i]} # added if(t>1) only print lines if count more than 1
          delete a[b[1],i]
       }
    }
}

Given the following file:

abc,2,3
def,3,4
ghi,2,3
jkl,5,9
mno,3,2

The output is as follows when the command is run:

Command: awk -F, -f find_duplicates.awk duplicates

Output:
TIMES  ID
2  2

abc,2,3
ghi,2,3

TIMES  ID
2  3

def,3,4
mno,3,2

This is fine.

I would like to understand what is happening in the AWK program.

I understand that the first line is loading each line into a multidimentional array ? So first line of file would be a['2','1']='abc,2,3' and so on.

However I'm a bit confised as to what c[$2]++ does, and also what is the significance of split(k,b,SUBSEP) ??

Would appreciate it if someone could explain line by line what is going on in this AWK program.

Thanks.

Answers


The increment operator simply adds one to the value of the referenced variable. So c[$2]++ takes the value for c[$2] and adds one to it. If $2 is a and c["a"] was 3 before, its value will be 4 after this. So c keeps track of how many of each $2 value you have seen.

for (k in a) loops over the keys of a. If the value of $2 on the first line was "a", the first value of k will be "a","1" (with 1 being the line number). The next time, it will be the combination of the value of $2 from the second line and the line number 2, etc.

The split(k,b,SUBSEP) will create a new array b from the compound value in k, i.e. basically reconstruct the parts of the compound key that went into a. The value in b[1] will now be the value which was in $2 when the corresponding value in a was created, and the value in b[2] will be the corresponding line number.

The final loop is somewhat inefficient; it loops over all possible line numbers, then skips immediately to the next one if an entry for that ID and line number did not exist. Because this runs inside the outer loop for (k in a) it will be repeated a large number of times if you have a large number of inputs (it will loop over all input line numbers for each input line). It would be more efficient, at the expense of some additional memory, to just build a final output incrementally, then print it all after you have looped over all of a, by which time you have processed all input lines anyway. Perhaps something like this:

END {
    for (k in a) {
        split (k,b,SUBSEP)
        if (c[b[1]] > 1) {
            if (! o[b[1]]) o[b[1]] = c[b[1]] "  " b[1] RS
            o[b[1]] = o[b[1]] RS a[k]
        }
        delete a[k]
    }
    for (q in o) print o[q] RS
}

Update: Removed the premature deletion of c[b[1]].


Need Your Help

PHP SHA-512 to Python+C SHA-512

php python sha512

I am working on a website for a game. The accounts are created via the php based website, and the game login server is being prototyped in Python, and will be finalized in C. The problem I am havin...

Windows service app that can dynamically accept console commands

c# visual-studio visual-studio-2008 windows-services console-application

I just wanted some input on a project that I'm working on. Basically, I'm creating a service that monitors and processes new files in a directory specified by a configuration file and other parame...