Compare DSV with awk and delete differences by overwriting input file

I'm writing a bash script that, among other things, compares two pipe delimited value files $OLDFILE and $NEWFILE.

I've been successful in appending any records only in the $NEWFILE to the $OLDFILE with the following awk statement:

awk -F "|" 'NR==FNR{a[$4]++}!a[$4]' $OLDFILE $NEWFILE >> $OLDFILE

However, I also want to delete any records in $OLDFILE that aren't in $NEWFILE after first running the above. I hoped I could accomplish this with something like:

awk -F "|" 'NR==FNR{a[$4]++}a[$4]' $OLDFILE $NEWFILE > $OLDFILE

I thought that this would compare the $OLDFILE to the $NEWFILE and overwrite $OLDFILE with only the lines that matched, but awk is appending the output to $OLDFILE instead of overwriting it.

What am I missing?

I'm open to a better way of doing this, if anyone has a suggestion.

Answers


If the fields are known to be in the same order in both files and both files are known to be sorted the same way, use comm (and if the files are not known to be sorted then some preprocessing with sort will fix it.)

comm -1 -3 oldfile newfile

This will list lines that appear only in newfile.

comm -1 -2 oldfile newfile

This will list lines that appear in both files only.

All together now

cat <(comm -1 -2 oldfile newfile) <(comm -1 -3 oldfile newfile) > combined

combined now contains lines appearing only in newfile plus lines appearing in both oldfile which were also in newfile.

Note: This is roughly the same as just saying comm -1 oldfile newfile but without any funny indentation.

Unfortunately you cannot write directly back into oldfile because it could be truncated before it is read. Just mv -f combined oldfile when you're done.


Thanks everyone for your input. I was finally able to accomplish this with a mixture of my initial approach and using comm, as suggested by @Sorpigal. Here's my solution for posterity.

# This appends new entries from $NEWFILE to the end of $OLDFILE
awk -F "|" 'NR==FNR{a[$4]++}!a[$4]' $OLDFILE $NEWFILE >> $OLDFILE

# This pulls out entries that are NOT in $NEWFILE but are in 
# $OLDFILE and should be deleted. It then outputs the entries to be 
# deleted to the $OUTFILE.
awk -F "|" 'NR==FNR{a[$4]++}!a[$4]' $NEWFILE $OLDFILE > $OUTFILE

# This line will effectively delete any lines that are in both 
# $OUTFILE and $OLDFILE, thus finally deleting any records not in
# $NEWFILE.
comm -3 <(sort $OUTFILE) <(sort $OLDFILE) > combined.csv

Thanks again everyone who took a look at this, especially @Sorpigal!!


Need Your Help

Looping data - loops twice only loads one data

php

I have a while loop that creates an array prior to json encoding the data but although i have tested it and i get 2 rows returned and it loops twice..my json encode only shows one of the rows of da...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.