Talk:RenameFastaSequences
From Bio.scipy.org
Hi, I changed the program quite a bit... I tried to make it more idiomatic Python. Let me know if you think I butchered it too bad! =) Njm 17:17, 30 October 2007 (CDT) [MDS- and I've revised the discussion and changed to sys.argv[1:])
I think it is useful to go through the bits of code here and see what they are doing. I am a (micro)biologist with some unix experience but not much of python, so please feel free to FIXME: edit/improve this page. OK, here goes...
# rename fasta sequences in file according to user input
# Copyright (c) 2007, Humberto Ortiz Zuazaga
always useful to say what the program is doing and who wrote it and when. The # prefix makes the compiler ignore what comes after
import sys
Why import the sys packages? I take it on faith that if you imported all the packages in biopython it would take up so much memory that most small computers would freeze. So the books say just to import the ones necessary for the job being done.
basename, filename = sys.argv[1:]
The user needs to give the filename of the file to be altered, and the basename of the new fasta header name. These two pieces of data are to be given by standard input (keyboard). The command to expect this input uses a sys module (? right term here?), sys.argv. In the usage example below the code, the user calls the program and gives it two arguments (foo, bar) to put in via sys.argv[1:], and those values are assigned to filename and basename. As an alternative, one could make this line clearer by breaking it into two statments:
'filename = sys.argv[1] '
'basename = sys.argv[2]'
In the above version, it is easier to see that two arguments [1] and [2] are being assigned to separate variables. The one used in the program uses a more condensed version for doing the same thing.
input = open(filename)
opens the fasta file and puts the contents in input
output = open('%s.new' % filename, 'w')
Is interesting. It opens a new file called filename.new, where 'filename' is the name you have given it. So in the example, it would be foo.new. This file is opened in 'write' mode (that is the (w), so you can write the modified fasta sequences to the new file. 'output' is the object created to hold the text (?)
count = 1
sets a variable count to an initial value of 1. This will be used (as a counter) later.
for line in input:
This is a FOR statement, so it will run for as long as the stipulated conditions are met. It takes each line of text, one at a time, and processes them according to the statements below. It will loop around until there are no more lines of text in your fasta file (or rather, the text object 'input' created from your fasta file).
Note the colon (:) at the end of the FOR line (it tells the compiler to look for statements to be carried out while the FOR condition is true.
if line.startswith('>'): takes the line fed to it by the previous FOR statement and checks to see if the line starts with a greater than sign, >. In a fasta file, this will be the header line.
HOZ: The previous version used python regular expressions, which are a bit of an overkill for the task. I still think re is a good habit to get into, soon enough the specifications will change, and you'll need more power than the simple string manipulations will easily allow. In python, there is a long tradition of "there's almost always one best way to do it", I think regular expressions are it for sequence manipulation.
output.write('>%s%d\n' % (basename, count))
If the line does have a > sign, then control goes to this statement, where the text line is altered to change the fasta header name to 'basename' and it is written to the new file as '>basenamecount\n'. This means it has a >, followed by the 'basename' you are using for the renaming (eg. 'bar') , followed by a number (whatever is held in 'count'). Initially, count is set to 1, but the next statement below adds another 1 to the value of count, so it goes up +1 everytime a new > line is processed.
count += 1
The number variable 'count' is increased by 1 each time this part of the loop is traversed.
else:
output.write(line)
Where there is no > sign detected in a line of text, the line is simply transferred to the new file, without any modification. The loop goes back to the FOR statement to process the next line until there are no lines left in the object 'input' (which holds your fasta file). Then the program exits.
[I would like some feedback on whether this is a useful way to start the discussion. Or do the moderators prefer this type of material to go in the email discussion list for this site??]
line contains the actual text of the line
I very much enjoyed your discussion, I hope it proves useful to you and others. I wanted to point out that the 'line' variable will hold the actual text of the line, not the line number, as you stated in your discussion. [done- MDS]

