Next-generation sequencing data are widely utilised for various downstream applications in bioinformatics and numerous techniques have been developed for PCR-deduplication and error-correction to eliminate bias and er...Next-generation sequencing data are widely utilised for various downstream applications in bioinformatics and numerous techniques have been developed for PCR-deduplication and error-correction to eliminate bias and errors introduced during the sequencing.This study first-time provides a joint overview of recent advances in PCR-deduplication and error-correction on short reads.In particular,we utilise UMI-based PCR-deduplication strategies and sequencing data to assess the performance of the solelycomputational PCR-deduplication approaches and investigate how error correction affects the performance of PCR-deduplication.Our survey and comparative analysis reveal that the deduplicated reads generated by the solely-computational PCR-deduplication and error-correction methods exhibit substantial differences and divergence from the sets of reads obtained by the UMI-based deduplication methods.The existing solelycomputational PCR-deduplication and error-correction tools can eliminate some errors but still leave hundreds of thousands of erroneous reads uncorrected.All the error-correction approaches raise thousands or more new sequences after correction which do not have any benefit to the PCRdeduplication process.Based on our findings,we discuss future research directions and make suggestions for improving existing computational approaches to enhance the quality of short-read sequencing data.展开更多
文摘Next-generation sequencing data are widely utilised for various downstream applications in bioinformatics and numerous techniques have been developed for PCR-deduplication and error-correction to eliminate bias and errors introduced during the sequencing.This study first-time provides a joint overview of recent advances in PCR-deduplication and error-correction on short reads.In particular,we utilise UMI-based PCR-deduplication strategies and sequencing data to assess the performance of the solelycomputational PCR-deduplication approaches and investigate how error correction affects the performance of PCR-deduplication.Our survey and comparative analysis reveal that the deduplicated reads generated by the solely-computational PCR-deduplication and error-correction methods exhibit substantial differences and divergence from the sets of reads obtained by the UMI-based deduplication methods.The existing solelycomputational PCR-deduplication and error-correction tools can eliminate some errors but still leave hundreds of thousands of erroneous reads uncorrected.All the error-correction approaches raise thousands or more new sequences after correction which do not have any benefit to the PCRdeduplication process.Based on our findings,we discuss future research directions and make suggestions for improving existing computational approaches to enhance the quality of short-read sequencing data.