parsing FASTA files using a generator

Published 2025-05-07 • Updated 2025-05-08

Pretty frequent/common task. The key is to create a generator so parsing any size FASTA is constant memory (depending on what you do with the generator). Something like:

def parse_fasta(path: str):
    with open(path, 'r') as f:
        header, seq_lines = None, []
        for line in f:
            line = line.rstrip('\n')
            if line.startswith(">"):
                if header is not None:
                    yield header, "".join(seq_lines)
                    seq_lines = []
                header = line[1:]
            else:
                seq_lines.append(line)
        yield header, "".join(seq_lines)

Then treat this as any generator:

for header, seq in parse_fasta(path):
  foo(seq)

Obviously this is only helpful with multiple sequences; a single large sequence will still have to be processed entirely.

It can also be beneficial to load the FASTA directly as bytes, rather than converting to strings:

def parse_fasta(path: str):
  with open(path, "rb") as fh:            # read in binary mode
          header, seq = None, bytearray()
          for line in fh:
              line = line.rstrip(b"\r\n")
              if line.startswith(b">"):
                  if header is not None:
                      yield header, bytes(seq)
                  header = line[1:].decode()
                  seq.clear()                  # reset bytearray
              else:
                  seq.extend(line)             # append raw bytes
          if header is not None:
              yield header, bytes(seq)

I think this approach is better for very large sequences, because it cuts the UTF-8 decoding step. It also might work better if streaming into a numpy array, but I’m not sure exactly how that works. If text manipulation is required, though, parsing as a string directly makes more sense.