Removing rows that don't contain strings from csv file, using one-line reg exp grep/sed

I have idsfile.csv which is a comma separated file of ids (with no new line characters in), and I would like to grab only the lines from a second datafile.txt file which have one of those ids in (surrounded by tabs).

Sample idsfile.csv:

000001,000002,000005,000007,000008,000009,000011,000021,000029,000040,...

Sample datafile.txt:

titl e1   000001   description1 
title2   000003   descr iption2 
ti tle3   000021   des cripti on3 
title4   000023   description4 

If I was doing this without having to read in the ids from a file I would try:

grep -Ev '/\t000001\t|\t000002\t|\t000003\t/' datafile.txt > output.txt

but I am unsure how to read in the comma separated values in a way that I could then use them in the regular expression.

Does anyone know how I might assemble this as a one line command query please? Perhaps with textscan?

Edit: Actually, if I changed idsfile.csv to have an id on each line (with a tab before and after), then would I line similar to this work please or, I expect, is the syntax quite wrong:

grep -Evf idsfile.csv datafile.txt > output.txt

Answers


The single line of data in idsfile.csv is hostile to this workflow - you will have to transform it into a series of lines. The Unix toolset is based around lines!

So, we need to transliterate the commas into newlines:

tr , '\012' < idsfile.csv > idsfile.lines
fgrep -f idsfile.lines datafile.txt

A POSIX-compliant 'grep' will also recognize:

grep -F -f idsfile.lines datafile.txt

You might even be able to get away with:

tr , '\012' < idsfile.csv |
grep -F -f - datafile.txt

This tells 'grep' to read the list of names to search for from its standard input.

Finally, if you're using GNU grep, you could add '-w' to search for words - it will require the pattern to be surrounded by non-alphanumeric characters (spaces in the examples). The '-w' option means that if a line in datatfile.txt contains

something 000002100  kkkk

the entry '000021' will not select that line (without the '-w', it would be selected).


Use sed to convert the contents of idsfile.csv into a regular expression for use with grep.


The following 1-liner uses awk to turn each field of the csv file into a list of regex for grep to match via the -f option. We then use Bash's process substitution syntax <( ) to treat the output of the awk command as a file (named pipe).

$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt
Input
$ cat sample.csv
000001,000003,000005,000007,000008,000009,000011,000023,000029

$ cat title.txt
titl e1 000001  description1
title2  000003  descr iption2
ti tle3 000021  des cripti on3
title4  000023  description4
Output
$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt
titl e1 000001  description1
title2  000003  descr iption2
title4  000023  description4

Note that the line containing 000021 did not match. Also not apparent is that each 6-digit number in title.txt are surrounded by tabs, not spaces.


Need Your Help

Weird LLVM Unsigned to Signed Conversion in For Loop

c++ for-loop llvm unsigned signed

I am not sure if this only happens on Apple's LLVM Compiler 4.0 (XCode 4.4.1), but I noticed the following behaviour:

How to authenticate Supertest requests with Passport?

node.js mocha supertest passport.js

I'm using Passport.js for authentication (local strategy) and testing with Mocha and Supertest.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.