Commit a5116640 authored by Laure QUINTRIC's avatar Laure QUINTRIC
Browse files

add trimreads option and remove rename remaining options

parent b5197d24
......@@ -2,42 +2,39 @@
## What for?
Classify, clean and sort forward and reverse reads from Genoscope sequencing data
Extract and organize forward and reverse reads from ligation sequencing data
Here is an example of raw sequencing files provided by the Genoscope :
For Sample 12BA308 :
* R1 : CCR_AAFDOSTA_2_1_HGWCMBCX2.12BA380_clean.fastq.gz
* R2 : CCR_AAFDOSTA_2_2_HGWCMBCX2.12BA380_clean.fastq.gz
For Sample :
* R1 : sample-R1.fastq.gz
* R2 : sample-R2.fastq.gz
4 files will be generated :
* R1F : ABYSS_NAME_R1F.fastq.gz for forward reads extracted from file 2_1
* R1R : ABYSS_NAME_R1R.fastq.gz for reverse reads extracted from file 2_1
* R2F : ABYSS_NAME_R2F.fastq.gz for forward reads extracted from file 2_2
* R2R : ABYSS_NAME_R2R.fastq.gz for reverse reads extracted from file 2_2
* R1F : sample_R1F.fastq.gz for forward reads extracted from file R1
* R1R : sample_R1R.fastq.gz for reverse reads extracted from file R1
* R2F : sample_R2F.fastq.gz for forward reads extracted from file R2
* R2R : sample_R2R.fastq.gz for reverse reads extracted from file R2
## Requirements
* The script will perform on raw sequencing files (fastq.gz) of one marker (ie : 18S-V1) for several samples
* properties.ini : containing forward and reverse primer sequences and number of allowed-mismatches to calculate cutadapt primer error rate for each marker
* extract.ini : Paths configuration and marker selection
* A CSV file with two columns containing : GENOSCOPE_SAMPLE_NAME;ABYSS_SAMPLE_NAME
* All the fastq files must be in the same input directory
* Python 3.6 is required
## Steps
* [Cutadapt](https://cutadapt.readthedocs.io/en/stable/) is run to separate reads matching the forward primer from reads matching to the reverse primer in raw sequencing files. For each file (R1 and R2), two files are created : R1F, R1R and R2F, R2R. The primers removed (--trimreads option) during this operation.
* For each sample, the 4 files are renamed according the genoscope/abyss names cvs file : ABYSS_NAME_R1F-cutadapt.tar.gz and ABYSS_NAME_R1R-cutadapt.tar.gz, ABYSS_NAME_R2F-cutadapt.tar.gz and ABYSS_NAME_R2R-cutadapt.tar.gz
* Reads from file ABYSS_NAME_R1R-cutadapt.tar.gz are all renamed with /2 extension instead of /1
* Reads from file ABYSS_NAME_R2F-cutadapt.tar.gz are all renamed with /1 extension instead of /2
* Files ABYSS_NAME_R1F-cutadapt.tar.gz and ABYSS_NAME_R2R-cutadapt.tar.gz are re-paired using [BBMAP repair](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/) script in order to remove singletons and to sort.
* Files ABYSS_NAME_R2F-cutadapt.tar.gz and ABYSS_NAME_R1R-cutadapt.tar.gz are re-paired using [BBMAP repair](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/) script in order to remove singletons.
* [Cutadapt](https://cutadapt.readthedocs.io/en/stable/) is run to separate reads matching the forward primer from reads matching to the reverse primer in raw sequencing files. For each file (R1 and R2), two files are created : R1F, R1R and R2F, R2R.
* Reads from file sample_R1R-cutadapt.tar.gz are all renamed with /2 extension instead of /1
* Reads from file sample_R2F-cutadapt.tar.gz are all renamed with /1 extension instead of /2
* Files sample_R1F-cutadapt.tar.gz and sample_R2R-cutadapt.tar.gz are re-paired using [BBMAP repair](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/) script in order to remove singletons and to sort.
* Files sample_R2F-cutadapt.tar.gz and sample_R1R-cutadapt.tar.gz are re-paired using [BBMAP repair](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/) script in order to remove singletons.
* Reads' name are modified : extensions /1 and /2 are removed in order to have the same name in file R1 than in file R2 which is a requirement for dada2 to recognize the pairs of reads.
Final output files are names :
* ABYSS_NAME_R1F.fastq.gz
* ABYSS_NAME_R1R.fastq.gz
* ABYSS_NAME_R2F.fastq.gz
* ABYSS_NAME_R2R.fastq.gz
* sample_R1F.fastq.gz
* sample_R1R.fastq.gz
* sample_R2F.fastq.gz
* sample_R2R.fastq.gz
* frogs/samples.tar archive for frogs
## How to run (IF RUN OUTSIDE ABYSS-PIPELINE) ?
**extract.ini** : configuration file to edit
......@@ -48,7 +45,6 @@ Final output files are names :
* BARCODE : name of the marker (ie : 18S-V1) (this marker must be listed in the PROPERTIES file)
* SAMPLENAME : path to the csv file containing abyss and genoscope sample names
* TRIMREADS : set to True if you want to perform all abyss preprocessing (cutadapt, bbmap, renaming) (option : True/False)
* RENAME : set to True if you only want to rename original samples with their abyss names without triming reads (option : True/False)
**extract.sh** : script which will run extract.py on the configuration file **extract.ini**. Each samples (and its two "paired-end" files) will be parse separately.
**extract.py** : python script that will read extract.ini file and launch extractR1R2.pbs calculation or each sample. The check.pbs script is run at the end to verify that all files are created at the end of the process for each sample.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment