Analyse de l’épissage alternatif dans les données RNAseq

Analyse de l’épissage alternatif dans les données
RNAseq

FaRLINE and KisSplice

Figure 1 shows the two pipelines that we are comparing. While STAR and TopHat are third-party softwares, we developed the other methods ourselves. KisSplice was introduced in [Sacomoto et al., 2012], kissDE was introduced in [Lopez-Maestre et al., 2016]. KisSplice2RefGenome and FaRLine are methods we introduce in this paper. For the sake of self-containment, we explain all methods here.

KisSplice KisSplice is a local transcriptome assembler

As most short reads transcriptome assemblers [Grabherr et al., 2011, Schulz et al., 2012, Robertson et al., 2010], it relies on a De Bruijn graph (DBG). Its originality lies in the fact that it does not try to assemble full-length transcripts. Instead, it assembles the parts of the transcripts where there is a variation in the exon content. By aiming at a simpler goal, it can afford to be more exhaustive and identify more splicing events. The key concept on which KisSplice is built is that variations in the nucleotide content of the transcripts will cor16 respond to specific patterns in the DBG called bubbles. KisSplice’s main algorithmic step therefore consists in enumerating all the bubbles in the graph built from the reads. The sequences corresponding to the two paths of each bubble are then aligned to the reference genome using STAR, and the result of the alignment is analysed using KisSplice2RefGenome to annotate the event.

Alternative splicing events are bubbles in the DBG

Supplementary figure S6 gives a schematic example of two alternative transcripts which differ by the inclusion of one exon. For the sake of simplicity, the example is given for words of length 3, but the reasoning holds for any word length. Each distinct word of length k is called a k-mer and corresponds to a node of the DBG. There is a directed edge from a node u to a node v if the last k − 1 nucleotides of u are identical to the first k − 1 nucleotides of v. Each transcript will therefore correspond to a path in the DBG. A pair of internally node-disjoint paths with a common source and target is called a bubble. The smaller path of the bubble corresponds to the exclusion isoform and is composed of all k-mers which overlap the junction between the exons flanking the skipped exon. It is therefore usually composed of k − 1 k-mers. In the special case where the skipped exon shares a prefix with its 3’ flanking exon, or a suffix with its 5’ flanking exon, then the lower path is composed of less than k −1 k-mers and the k-mer which is the source (resp. target) does not correspond anymore to an exonic k-mer, but to a junction k-mer. In practice, the DBG is built from the reads, not from the transcripts. The reads stem from possibly all genes expressed in the studied conditions. Two difficulties arise: reads contain sequencing errors, and repeats may be shared across genes.

Dealing with sequencing errors

As originally described in [Pevzner et al., 2004] and later in [Zerbino and Birney, 2008], sequencing errors generate recognisable structures in De Bruijn graphs, which can be identified and removed. Their systematic removal however prevents assemblers from studying SNPs. A compromise consists in discarding rare k-mers from the graph. This is the strategy we use in KisSplice, where we remove all k-mers seen only once. This idea is however not sufficient in the context of transcriptome assembly, where the coverage is very uneven and mostly reflects expression levels. For highly expressed genes, several reads may have errors at the same site, generating k-mers with a coverage larger than an absolute threshold. We therefore also use a relative cut-off, which we set to 2%. These cut-offs we introduce to remove sequencing errors have an impact on the running time and on the sensitivity. Decreasing them allows to discover rarer isoforms, at the expense of a longer running time.

Dealing with repeats

Repeats are notoriously difficult to assemble in DNAseq data, and were initially thought to be much less problematic in RNAseq, since they are mostly located in introns and intergenic regions. In practice, mRNA extraction protocols are not perfect, and a fraction of pre-mRNA remains (typically 5% for total polyA+ RNA [Tilgner et al., 2012]). Each intron is covered by few reads, but if a repeat is present in many introns, then this repeat will obtain a high coverage and will correspond to very dense subgraphs in the De Bruijn graph built from the reads. The traversal of such subgraphs to 18 enumerate all the bubbles they contain is long and mostly fruitless. We showed in [Sacomoto et al., 2014] that an effective strategy to deal with this issue is to enumerate only bubbles which have at most b branches. In practice, we set b to 5. Increasing b will increase the running time, but allow to find more repeat-associated alternative splicing events. Bubbles which do not correspond to true AS events can be filtered out at the mapping step.

Annotating the events with KisSplice2RefGenome

Bubbles found by KisSplice are mapped to the reference genome using STAR, with its default settings, which means that in case of multi-mappings, STAR reports all equally best matches. The mapping results are then analysed by KisSplice2RefGenome. At this stage, bubbles are classified in sub-types depending on the number of blocks obtained when mapping each path of the bubble to the genome (Supplementary Figure S7). For exon skippings, the longer path of the bubble corresponds to 3 blocks, while the lower path corresponds to 2 blocks. The splice sites are located and compared to the annotations. Events with novel splice sites are reported explicitly in the output of the program. In the case where the bubble corresponds to a genomic insertion or deletion, it exhibits a specific pattern in terms of block numbers and is reported separately. In the case where the bubble maps to two locations on the genome, a distinction is made between the case of exact repeats where both paths map to both locations and inexact repeats where each path maps to a distinct location (Supplementary Figure S8). The cases of exact repeats corresponds to recent paralogs.

FaRLine

FasterDB EnsEMBL r75 annotation FasterDB RNAseq Pipeline, FaRLine, use the FasterDB-based EnsEMBL r75 annotation database. FasterDB is a database containing all annotated human splicing variants [Mallinjoud et al., 2014]. The genomic exons are defined by projecting the transcript exons (Supplementary Figure S9). First, the transcript exons are grouped by position. Then each group of exons define a projected exon with the following rules: • The start is the smallest start of the non-first-exon of the group. • The end is the highest end of the non-last-exon of the group that ends before the start of the next group of exons. When the most frequent event annotated in the transcrits is an intron retention, the projected genomic exon is defined as a combination of the two exons the intron retained. In supplementary figure S9, the exons 5 and 6 and the intron 5 are considered as one unique exon. As events included in an exon are overseen, this results in some events being missed. Mapping The first step of FaRLine is to map the reads to a reference genome. This step is done using Tophat-2.0.11 [Trapnell et al., 2012]. tophat –min-intron-length 30 –max-intron-length 1200000 \ -p 8 [–solexa1.3-quals for Sknsh_rep1 and Sknsh_rep2] \ –transcriptome-index A transcriptome index has been built by TopHat using EnsEMBL r75 annotations in gtf format. When a transcriptome index is used, the mapping steps are modified: instead of aligning first to the genome, which is the 20 default behavior, TopHat uses Bowtie to align the reads to the transcript sequences first, then align the remaining unmapped reads to the genome. Minimal and maximal intron lengths have been modified (default 70 and 500000) to maximize the number of junctions detected, according to the statistics provided by FasterDB EnsEMBL r75 annotations. The resulting alignment files have been filtered using samtools 0.1.19 [Li et al., 2009]. samtools view -F 260 -f 1 -q 10 -b With this step, only the primary alignments are kept. The minimum read alignment quality was set up so that multi-mapping reads were removed from the alignment file. Annotation and quantification of alternative splicing events We wrote custom perl scripts, based on the FasterDB-based EnsEMBL r75 annotation database. For each gene, all the reads with at least one base overlapping the gene from the start to the end coordinates are retrieved. CIGAR strings are then used to retrieve the alignments blocks. Junction reads are identified by the presence of at least one ’N’ letter in the CIGAR. Junction reads were filtered if: • More than 10% of soft-clipping was detected in the alignment (it should not be the case with TopHat) • An indel was close to the junction site, as it would make the junction position uncertain Junction read alignments are then processed block by block sequentially from left to right. Alignment blocks under 4bp on read extremities are removed from the reads as we considered it is not sufficient to identify cor21 rectly the mapping localization. Then each block is compared to FasterDB annotations to check if the block boundaries correspond to known exons annotated in FasterDB, or to a putative new acceptor or donor site. First and last alignment blocks for each read must overlap one and only one exon for a read to be considered. For the inner blocks, if alignment blocks map to a succession of exons and introns, it is considered as an intron retention. However, as the read size is only 76bp, this should not happen often. For the acceptors and donors, we also added a supplementary filter. If a new donor is identified within a junction, we check if the junction also has an acceptor identified of the same length +/-1bp on the other side of the junction, showing most probably a problem of mapping. Once all the blocks are identified, the block annotations are used to annotate putative alternative splicing events: alternative skipped exon, multiple exon skipping, acceptor, or donor sites.

Table des matières

1 Introduction
I L’épissage
A Mécanismes
1 Réaction d’épissage
2 Régulation de l’épissage
3 Les différents types d’événements d’épissage
B Importance de l’épissage alternatif
C Épissage alternatif et pathologies
II Les données à large échelle : le séquençage
A Les technologies de séquençage haut-débit
1 Les principes généraux du séquençage de seconde génération
2 Les technologies de séquençage de seconde génération
3 Les technologies de séquençage de troisième génération
B Les différentes applications
C L’analyse des données haut débit
D Apports des données large échelle
III Méthodes d’analyse de l’épissage
A Historique de l’analyse de l’épissage : du cas par cas au large échelle
1 La RT-PCR
2 La PCR haut débit
3 Les puces à ADN
B Le séquençage du transcriptome : RNA-Seq
C L’analyse de l’épissage dans les données RNA-seq
1 Pipeline d’analyse de l’épissage alternatif
2 Contrôle qualité
3 Alignement
4 Assemblage
5 Identification des événements d’épissage ou des isoformes
6 Quantification
7 Analyse différentielle
8 Visualisation
9 Pour aller plus loin
2 Objectifs
3 Résultats
I Développement d’un pipeline d’analyse de l’épissage alternatif dans les données RNA-Seq
A FasterDB
B FaRLine
C Pipeline complet
D Visualisation RNA-Seq
E Validation de la méthode
F Valorisation de la méthode
G Discussion
II Comparaison d’une méthode basée sur l’alignement à une basée sur l’assemblage
A Publication soumise à Genome Research
B Discussion
III Études de l’impact fonctionnelle de l’épissage alternatif
A Annotation protéique à l’échelle des exons
B Analyse des voies de signalisation
C Discussion
4 Conclusion et perspectives
5 Annexes
I Application à des données de collaborateurs
II Exon ontologie