FASTA Files

FASTA File is the most common sequence format in CADD

A FASTA file is plain text with:

a header line starting with >
the sequence on the following lines.

Example of a FASTA file

>My_Target_Protein some optional description

MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAN

A FASTA is used in CADD for:

Structure prediction (AlphaFold / ColabFold)
Homology modeling
Target selection pipelines
Sequence checks before modeling

Reading a FASTA File in Google Colab

Let's create a FASTA file inside Google Colab. We are going to write a tiny FASTA file programmatically.

Create and download a FASTA file named "example.fasta" (Figure 1).

Copy/paste into a code cell.

Code:

fasta_text = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAN"

with open("example.fasta", "w") as f:

f.write(fasta_text)

print("Created example.fasta")

Figure 1: Creation of a FASTA file.

Let us read the FASTA file "example.fasta" that we created above using SeqIO (Figure 2).

Copy/paste into a code cell.

Code:

from Bio import SeqIO

records = list(SeqIO.parse("example.fasta", "fasta"))

print("Number of sequences:", len(records))

Figure 2: Reading a FASTA File.

Let us inspect the content of the FASTA file "example.fasta" that we created above, to view the sequence and the length (Figure 3).

Copy/paste into a code cell.

Code:

record = records[0]

print("ID:", record.id)

print("Description:", record.description)

print("Sequence:", record.seq)

print("Length:", len(record.seq))

Figure 3: Content of the FASTA File.

Now, let us write a Clean FASTA File for the CADD project.

If you want to do any CADD project, it is a good habit to export a clean, validated FASTA for each project.

Let's create a drug_target.fasta file (Figure 4).

Copy/paste into a code cell.

Code:

from Bio.SeqRecord import SeqRecord

clean_record = SeqRecord(

record.seq,

id="Drug_Target_Protein",

description="Validated protein sequence for CADD"

)

Figure 4: Create a proper SeqRecord

Write it to disk (Save the fasta file)

Generate a drug_target.fasta file (Figure 5), which is now a clean input file you can feed into structure prediction or modeling.

Copy/paste into a code cell.

Code:

from Bio import SeqIO

SeqIO.write(clean_record, "drug_target.fasta", "fasta")

print("Saved drug_target.fasta")

Figure 5: Save the fasta file

Now that you have a clean protein FASTA (drug_target.fasta), you can:

Run AlphaFold/ColabFold to predict a structure
perform homology modeling
Align homologs to find conserved residues near the binding site.
Map known mutations (resistance or variants).

References

Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 1;25(11):1422-3. doi: 10.1093/bioinformatics/btp163.
Biopython Documentation: https://biopython.org/wiki/Documentation

Page updated

Google Sites

Report abuse