parsing FASTA files using a generator
Published 2025-05-07 • Updated 2025-05-08
Pretty frequent/common task. The key is to create a generator so parsing any size FASTA is constant memory (depending on what you do with the generator). Something like:
def parse_fasta(path: str): with open(path, 'r') as f: header, seq_lines = None, [] for line in f: line = line.rstrip('\n') if line.startswith(">"): if header is not None: yield header, "".join(seq_lines) seq_lines = [] header = line[1:] else: seq_lines.append(line) yield header, "".join(seq_lines)Then treat this as any generator:
for header, seq in parse_fasta(path): foo(seq)Obviously this is only helpful with multiple sequences; a single large sequence will still have to be processed entirely.
It can also be beneficial to load the FASTA directly as bytes, rather than converting to strings:
def parse_fasta(path: str): with open(path, "rb") as fh: # read in binary mode header, seq = None, bytearray() for line in fh: line = line.rstrip(b"\r\n") if line.startswith(b">"): if header is not None: yield header, bytes(seq) header = line[1:].decode() seq.clear() # reset bytearray else: seq.extend(line) # append raw bytes if header is not None: yield header, bytes(seq)I think this approach is better for very large sequences, because it cuts the UTF-8 decoding step. It also might work better if streaming into a numpy array, but I’m not sure exactly how that works. If text manipulation is required, though, parsing as a string directly makes more sense.