This is a plug for Pysam: a well designed and feature rich wrapper for HTSlib. I’ve used pysam for a few years now and it has gotten better with time. It is now a very efficient and convenient library for reading and writing BAM files, reading VCF, FASTQ and FASTA files. For VCF and FASTA files it can use the tabix index to read regions.
I first used Pysam to read BAM files, which was the most feature complete part when I started. I started to use it for writing BAM files when the quirks were ironed away. I used to use PyVCF for reading use my own (inefficient) reader for FASTA and FASTQ files.
In it’s current state (0.9.1.4) Pysam is a very effective solution to reading, in addition, VCF, FASTA and FASTQ files. It transparently handles gzipped files and BCF files, so you don’t have to keep track of the file type yourself.
Not only is it able to use the VCF and FASTA indices it also returns the data nicely encapsulated. My only complaint is that I feel the VCF record format is a tad over-complicated (it uses a similar interface to PyVCF). I always end up with very complicated accessor expressions when I wish to get alleles and genotypes from a record.
What I really liked about the VCF reader is that it intelligently handles the case of deletions crossing a region boundary. So for example, if you ask for variants in the range 10 to 20 and there is a 5 base deletion starting at position 8, it will be captured because it crossed into the requested regions. It’s very nice not to have to write your own logic to handle such cases.
I’ve also found VCF reading efficient enough for my purposes. When reading a single sample from multi-sample (population) VCF files it helps to set the filter, so that only the required sample is loaded.