Slice specific characters in CSV using python
I have data in tab delimited format that looks like:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). I figured the best way to do this would be to use match and the genfromtxt in numpy. This example is as far as I have gotten:
import re csvfile = 'home/python/batch1.hg19.table' from numpy import genfromtxt data = genfromtxt(csvfile, delimiter="\t", dtype=None) for i in data: m = re.match('[0-9]/[0-9]', i) if m: print m.group(0), else: print "NA",
This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file.
Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this?
Numpy is great when you want to load in an array of numbers. The format you have here is too complicated for numpy to recognize, so you just get an array of strings. That's not really playing to numpy's strength.
Here's a simple way to do it without numpy:
result= with open(csvfile,'r') as f: for line in f: row= for text in line.split('\t'): match=re.search('([0-9]/[0-9])',text) if match: row.append(match.group(1)) else: row.append("NA") result.append(row) print(result)
# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]
on this data:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00 ---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
Unless you really want to use NumPy, try this:
file = open('home/python/batch1.hg19.table') for line in file: for cell in line.split('\t'): print(cell[:3])
Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for.