fasta: delete sequences after n length

I have multiple fasta files with 1000s of seqs in each file of varying length. I would like to keep only the first 200 (n) bases from each sequence. How can I do this in Perl?

Answers


Difficult to understand exactly what you mean without seeing an example but if you only need the first 200 characters per line just use cut:

cut -c1-200 file

If the sequence is printed on several physical lines, only print up through the 200th character. A line starting with a wedge is a header line, which indicates the start of a new sequence.

awk '/^>/{ seqlen=0; print; next; }
    seqlen < 200 { if (seqlen + length($0) > 200)
            $0 = substr($0, 1, 200-seqlen);
        seqlen += length($0); print }' file.fasta >newfile.fasta

Oh, in Perl?

perl -nle 'if (/^>/) { $seqlen = 0; print; next }
    next if ($seqlen >= 200);
    $_ = substr($_, 0, 200-$seqlen) if ($seqlen + length($_) > 200);
    $seqlen += length($_);
    print;' file.fasta >newfile.fasta

If the sequence is too long, keep only the interesting part:

$/ = '>';
<>;
while (my $seq = <>) {
    $seq =~ s/>$//;
    $seq =~ s/^(.*)//;
    my $id = $1;
    $seq =~ s/\n//g;
    $seq = substr $seq, 0, 200;
    print ">$id\n$seq\n";
}

I recommend that you consider using BioPerl for this sort of thing because it is very easy to accomplish these tasks and you don't have to worry about things like formatting. In the code below, the first argument to the script is your fasta and the second argument is a file to hold only the first 200 bases of each sequence.

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;

my $usage = "$0 infile outfile\n";
my $infile = shift or die $usage;
my $outfile = shift or die $usage;

my $seqin = Bio::SeqIO->new(-file => $infile, -format => 'fasta');
my $seqout = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta');

while (my $seq = $seqin->next_seq) {
    my $first200 = $seq->subseq(1,200); # 1-based
    my $subseq = Bio::Seq->new(-seq => $first200, -id => $seq->id);
    $seqout->write_seq($subseq);
}

Here's how i solve it, if anyone interested in trying a another way to do it i used a tool included in biolinux called Fasta_formatter to put the actual sequence in one line (-w 0), then trimming as @sudo_O said, and then finally back to the 80 letters width.

fasta_formatter -w 0 < FILE | cut -c1-LENGTH | fasta_formatter -w 80 > TRIMMED_FILE

Need Your Help

Android, the top of the layout looks transparent

android image

I have a layout with a background image and also a list view that each element in it has a a background image as well.

send plain/text email and getting =0D=0A in email response from server

c# asp.net email plaintext

When i send a email from my site to Uniform http://co.za registrar and cc myself in the mail i get an email returned from them that they received the email, but cannot find some of the information ...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.