A FASTA file is plain text with:
a header line starting with >
the sequence on the following lines.
Example of a FASTA file
>My_Target_Protein some optional description
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAN
A FASTA is used in CADD for:
Structure prediction (AlphaFold / ColabFold)
Homology modeling
Target selection pipelines
Sequence checks before modeling
Reading a FASTA File in Google Colab
Let's create a FASTA file inside Google Colab. We are going to write a tiny FASTA file programmatically.
Create and download a FASTA file named "example.fasta" (Figure 1).
Copy/paste into a code cell.
Code:
fasta_text = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAN"
with open("example.fasta", "w") as f:
f.write(fasta_text)
print("Created example.fasta")
Figure 1: Creation of a FASTA file.
Let us read the FASTA file "example.fasta" that we created above using SeqIO (Figure 2).
Copy/paste into a code cell.
Code:
from Bio import SeqIO
records = list(SeqIO.parse("example.fasta", "fasta"))
print("Number of sequences:", len(records))
Figure 2: Reading a FASTA File.
Let us inspect the content of the FASTA file "example.fasta" that we created above, to view the sequence and the length (Figure 3).
Copy/paste into a code cell.
Code:
record = records[0]
print("ID:", record.id)
print("Description:", record.description)
print("Sequence:", record.seq)
print("Length:", len(record.seq))
Figure 3: Content of the FASTA File.
If you want to do any CADD project, it is a good habit to export a clean, validated FASTA for each project.
Let's create a drug_target.fasta file (Figure 4).
Copy/paste into a code cell.
Code:
from Bio.SeqRecord import SeqRecord
clean_record = SeqRecord(
record.seq,
id="Drug_Target_Protein",
description="Validated protein sequence for CADD"
)
Figure 4: Create a proper SeqRecord
Write it to disk (Save the fasta file)
Generate a drug_target.fasta file (Figure 5), which is now a clean input file you can feed into structure prediction or modeling.
Copy/paste into a code cell.
Code:
from Bio import SeqIO
SeqIO.write(clean_record, "drug_target.fasta", "fasta")
print("Saved drug_target.fasta")
Figure 5: Save the fasta file
Now that you have a clean protein FASTA (drug_target.fasta), you can:
Run AlphaFold/ColabFold to predict a structure
perform homology modeling
Align homologs to find conserved residues near the binding site.
Map known mutations (resistance or variants).
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 1;25(11):1422-3. doi: 10.1093/bioinformatics/btp163.
Biopython Documentation: https://biopython.org/wiki/Documentation