Dear all,
In KCRI's Illumina Basecall workflow, I always used to run FastQScreen
(with GCHR38 for human, and UniVec_Core for contaminants) to detect
contamination.
It's slow but it gives easily interpretable output (including graphs and
MultiQC integration), and if desired you very easily make it "bucket"
the reads into the database(s) that they map on.
When I copied the job into the Nanopore basecall workflow I was in for a
surprise. The jobs took forever or ran out of requested HPC memory (if I
remember well, even 96G didn't cut it for some runs).
Clearly, like FastQC, FastQScreen is an oldie, and especially read
mapping (which is effectively what it does) has since been optimised a
lot. My intuition would be that e.g. KMA would do this in a fraction of
the time. (cc-ing Philip in case he's not on this list)
The only thing that FastQScreen effectively adds to the mapping is the
user-friendly table and graph.
Anyone keen to add a newer mapper to FastQScreen (it now has BWA and
Bowtie2 as options, if I remember well)? Alternatively, suggestions for
an alternative to FastQScreen?
During the CoP meeting, more suggestions were made:
- Use Kraken (with the added advantage of getting much more
information than the "yes/no contamination" from FastQScreen), including
quantification of cross-sample/species contamination
- Why detect contamination on reads, when it's much easier to do this
on the assembly, contaminants will come "falling out" anyway
- Also: for assemblies there are tools such as CheckM (for quantifying
both completeness and contamination, including within-species)
Marco