Unable to remove newlines from Emsembl FASTA

I'm trying to find protein motifs from an Ensembl FASTA file. I've gotten the bulk of the script done, such as retrieving the sequence ID and the sequence itself, but I am receiving some funny results.

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $motif1 = qr/(HE(\D)(\D)H(\D{18})E)/x;
my $motif2 = qr/(AMEN)/x;
my $input;
my $output;
my $count_total     = 0;
my $count_processed = 0;
my $total_run       = 0;
my $id;
my $seq;
my $motif1_count    = 0;
my $motif2_count    = 0;
my $motifboth_count = 0;

############################################################################################################################
# FILEHANDLING - INPUT/OUTPUT
# User input prompting and handling
print "**********************************************************\n";
print "Question 3\n";
print "**********************************************************\n";

#opens the user input file previously assigned to varible to new variable or kills script.
open my $fh, '<', "chr2.txt" || die "Error! Cannot open file:$!\n";

#Opens and creates output file previously assigned to variable to new variable or kills script
#open(RESULTS, '>', $output)||die "Error! Cannot create output file:$!\n";

# FILE and DATA PROCESSING
############################################################################################################################

while (<$fh>) {

    if (/^>(\S+)/) {
        $count_total = ++$count_total;    # Plus one to count
        find_motifs($id, $seq) if $seq;   # Passing to subroutine
        $id = substr($1, 0, 15);          # Taking only the first 16 characters for the id
        $seq = '';
    }
    else {
        chomp;
        $seq .= $_;
    }
}

print "Total proteins: $count_total \n";
print "Proteins with both motifs: $motifboth_count \n";
print "Proteins with motif 1: $motif1_count \n";
print "Proteins with motif 2: $motif2_count \n";

exit;

######################################################################################################################################
# SUBROUTINES
#
# Takes passed variables from special array
# Finds the position of motif within seq
# Checks for motif 1 presence and if found, checks for motif 2. If not found, prints motif 1 results
# If no motif 1, checks for motif 2

sub find_motifs {
    my ($id, $seq) = @_;
    if ($seq =~ $motif1) {
        my $motif_position = index $seq, $1;
        my $motif = $1;
        if ($seq =~ $motif2) {
            $motif1_count    = ++$motif1_count;
            $motif2_count    = ++$motif2_count;
            $motifboth_count = ++$motifboth_count;
            print "$id, $motif_position, \n$motif \n";
        }
        else {
            $motif1_count = ++$motif1_count;
            print "$id, $motif_position,\n $motif\n\n";
        }
    }
    elsif ($seq =~ $motif2) {
        $motif2_count = ++$motif2_count;
    }
}

What is happening is that if the motif is found at the end of one line of data and the beginning of the next one, it will return the motif with the newline in the data. This method of slurping in data has worked well before.

Sample Results:

ENSG00000119013, 6,  HEHGHHKMELPDYRQWKIEGTPLE (CORRECT!)

ENSG00000142327, 123,  HEVAHSWFGNAVTNATWEEMWLSE (CORRECT!) 

ENSG00000151694, 410, **AECAPNEFGAEHDPDGL**

This is the problem. The motif matches but returns the first half, the newline, then prints the second half on the same line as well (which is a symptom of the larger problem - Getting rid of the newline!)

Total proteins: 13653  
Proteins with both motifs: 1  
Proteins with motif 1: 12  
Proteins with motif 2: 22

I've tried different methods such as @seq =~ s/\r//g or `s/\n//g and at different places within the script.

Answers


It's not clear from your description, but "prints the second half on the same line as well" sounds like your output is overlaid on itself because it has a carriage-return character at the end.

This happens if you are running on a Linux system and you just chomp a line that has come from Windows.

You should replace chomp with s/\s+\z// which will remove all trailing whitespace. And because both carriage return and linefeed count as "whitespace" it will remove all possible termination characters.

By the way, you are misunderstading the purpose of the ++ operator. It also modifies the contents of the variable it is applied to so all you need is ++$motif1_count etc. Your code works as it is because the operator also returns the value of the incremented variable, so $motif1_count = ++$motif1_count first increments the variable and then assigns it to itself.

Also, you use \D in your regex. Are you aware that this matches any non-digit character? It seems a very vague classification to be useful.


Need Your Help

Symfony 2 Form not saving data from select (on joined table)

php symfony2 doctrine

I have created a form that appears to be correct, it has a few text fields and a select box with a list of countries pulled from a table of countries I have. The select box displays correctly using...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.