Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The . gov means it’s official. Federal government websites often end in VSports app下载. gov or . mil. Before sharing sensitive information, make sure you’re on a federal government site. .

Https

The site is secure V体育官网. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely. .

. 2010 Nov;38(21):7400-9.
doi: 10.1093/nar/gkq655. Epub 2010 Jul 29.

Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies

Affiliations

"V体育官网" Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies

VSports - Osvaldo Zagordi et al. Nucleic Acids Res. 2010 Nov.

Abstract

Next-generation sequencing technologies can be used to analyse genetically heterogeneous samples at unprecedented detail. The high coverage achievable with these methods enables the detection of many low-frequency variants. However, sequencing errors complicate the analysis of mixed populations and result in inflated estimates of genetic diversity. We developed a probabilistic Bayesian approach to minimize the effect of errors on the detection of minority variants. We applied it to pyrosequencing data obtained from a 1. 5-kb-fragment of the HIV-1 gag/pol gene in two control and two clinical samples. The effect of PCR amplification was analysed. Error correction resulted in a two- and five-fold decrease of the pyrosequencing base substitution rate, from 0. 05% to 0. 03% and from 0. 25% to 0. 05% in the non-PCR and PCR-amplified samples, respectively. We were able to detect viral clones as rare as 0. 1% with perfect sequence reconstruction. Probabilistic haplotype inference outperforms the counting-based calling method in both precision and recall. Genetic diversity observed within and between two clinical samples resulted in various patterns of phenotypic drug resistance and suggests a close epidemiological link. We conclude that pyrosequencing can be used to investigate genetically diverse samples with high accuracy if technical errors are properly treated VSports手机版. .

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Posterior probability of reconstructed haplotypes. The algorithm computes posterior probabilities for the inferred haplotypes and their frequency given the observed reads. The figure shows (for window 3 in the PCR-amplified control experiment) the posterior of the haplotype frequencies in a box-plot (red box-plots) and the posterior of the reconstructed haplotype sequences (blue circles). In most cases, the box-plot height (lower-upper quartile) is invisible on this scale, because the clustering assignment is stable and the number of reads assigned to the cluster in the sampling does not change. The posterior distribution for the frequency is then very peaked. The figure reports estimates for the haplotypes reported by the algorithm without further processing. With additional analysis one finds that haplotypes 12 and 13 differ by one gap only in a homopolymeric region, and that their posterior probabilities sum up to one. Moreover, haplotype 1 consists of four reads and is the result of a recombination event between two of the original haplotypes.
Figure 2.
Figure 2.
Precision–recall analysis. We considered the haplotypes inferred in all windows by the clustering algorithm and by the cut-off method based on the minimum number of reads supporting the variant. Red circles represent precision and recall for a set of threshold values chosen in the cut-off method (values from 1 to the number of reads in the most-supported haplotype), red arrows annotate points for a cut-off equal to 50. We performed a similar analysis on the output of the clustering algorithm, considering haplotypes whose confidence value (posterior probability) was greater or equal than a given threshold. Blue squares represent threshold values from 0.01 to 1, with blue arrows annotating the values obtained when the threshold is 0.9. Dashed lines and arrows are used for points obtained in the non-PCR-ampified sample, solid lines and arrows for points in the PCR-amplified one. In the non-PCR-amplified sample, we have a perfect precision (no false positives), and very good results for the recall. In the PCR-amplified sample, some false positives are found. In both cases, the performance of the clustering method is superior to the cut-off method. Results for individual windows can be found in Supplementary Data.
Figure 3.
Figure 3.
Frequency estimation with the clustering method. In each window true frequency of the haplotypes was estimated by aligning the raw reads to the original sequences (direct mapping) for non-PCR-amplified sample (a) and PCR-amplified sample (b). Then, haplotypes were reconstructed and it was checked whether they matched the originals in identity and frequency. Circles represent perfect matches with one of the original haplotypes, triangles indicate imperfect match. Except for a few spurious cases at low frequencies, there is good agreement both in identity and frequency between inferred and actual haplotypes.
Figure 4.
Figure 4.
HIV protease amino acid allele frequency spectra of two patient samples. We analysed the frequency of amino acid substitution in the protease for two patients suspected to be in the same infection chain. They both present the drug resistance mutation I54V. Their consensus sequences differ at Position 10, Patient 1 showing an isoleucine and Patient 2 a valine. The horizontal line shows the 20% threshold typical of Sanger sequencing.
Figure 5.
Figure 5.
Structure of the viral quasispecies and predicted resistance to lopinavir. Circles represent detected haplotypes translated into amino acid sequences. The size reflects the frequency of the amino acid sequences, while the fill colour indicates the predicted resistance to the PI lopinavir. Green indicates higher and red lower levels of predicted drug susceptibility. The circles are positioned in the plot such that their distance approximately preserves the Hamming distance of the amino acid sequences. The number and the letter next to each circle denote, respectively, the patient and the protease sequence reported in the Supplementary Data.

References

    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Metzker ML. Sequencing technologies—the next generation. Nat. Rev. Genet. 2010;11:31–46. - PubMed (VSports最新版本)
    1. Miller W, Drautz DI, Ratan A, Pusey B, Qi J, Lesk AM, Tomsho LP, Packard MD, Zhao F, Sher A, et al. Sequencing the nuclear genome of the extinct woolly mammoth. Nature. 2008;456:387–390. - PubMed
    1. Shah SP, Morin RD, Khattra J, Prentice L, Pugh T, Burleigh A, Delaney A, Gelmon K, Guliany R, Senz J, et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature. 2009;461:809–813. - PubMed
    1. Varley KE, Mutch DG, Edmonston TB, Goodfellow PJ, Mitra RD. Intra-tumor heterogeneity of MLH1 promoter methylation revealed by deep single molecule bisulfite sequencing. Nucleic Acids Res. 2009;37:4603–4612. - PMC - PubMed

LinkOut - more resources (V体育官网)