Sam’s Notebook: Quality Trimming – C.gigas Larvae OA BS-Seq Data

Below is a from Sam’s Lab Notebook and provides some insight into some of our genomic focused ocean acidification studies.

NBviewer: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb


http://ift.tt/1Dlv80h

from Sam’s Notebook http://ift.tt/1DKlW74

Sam’s Notebook: Sequence Data Analysis – C.gigas Larvae OA BS-Seq Data

Below is a from Sam’s Lab Notebook and provides some insight into some of our genomic focused ocean acidification studies.

Compared total amount of data generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes.

Index: CTTGTA

$ ls -l 2212_lane2_[C]* | awk '{sum += $5} END {print sum/1024/1024/1024}'
5.33341

Index: GCCAAT
$ ls -l 2212_lane2_[G]* | awk '{sum += $5} END {print sum/1024/1024/1024}'
7.00596

There’s ~1.4x data in the GCCAAT files.

 

Ran FASTQC on the following files downloaded earlier today:

2212_lane2_CTTGTA_L002_R1_001.fastq.gz
2212_lane2_CTTGTA_L002_R1_002.fastq.gz
2212_lane2_CTTGTA_L002_R1_003.fastq.gz
2212_lane2_CTTGTA_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_001.fastq.gz
2212_lane2_GCCAAT_L002_R1_002.fastq.gz
2212_lane2_GCCAAT_L002_R1_003.fastq.gz
2212_lane2_GCCAAT_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_005.fastq.gz
2212_lane2_GCCAAT_L002_R1_006.fastq.gz

 

The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle:

$for file in /Volumes/nightingales/C_gigas/2212_lane2_[CG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done

 

From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date:

$for file in 2212_lane2_[GC]*; do mv "$file" "20150413_$file"; done

 

Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry:

$for file in 20150413_2212_lane2_[CG]*.zip; do unzip "$file"; done

 

The unzip output retained the old naming scheme, so I renamed the unzipped folders:

$for file in 2212_lane2_[GC]*; do mv “$file” “20150413_$file”; done

 

The FASTQC results are linked below:

20150413_2212_lane2_CTTGTA_L002_R1_001_fastqc.html

20150413_2212_lane2_CTTGTA_L002_R1_002_fastqc.html
20150413_2212_lane2_CTTGTA_L002_R1_003_fastqc.html
20150413_2212_lane2_CTTGTA_L002_R1_004_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_001_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_002_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_003_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_004_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_005_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_006_fastqc.html

 

from Sam’s Notebook http://ift.tt/1Exd9WS

Sam’s Notebook: Sequence Data – C.gigas OA Larvae BS-Seq Demultiplexed

Below is a from Sam’s Lab Notebook and provides some insight into some of our genomic focused ocean acidification studies.

I had previously contacted Doug Turnbull at the Univ. of Oregon Genomics Core Facility for help demultiplexing this data, as it was initially returned to us as a single data set with “no index” (i.e. barcode) set for any of the libraries that were sequenced. As it turns out, when multiplexed libraries are sequenced using the Illumina platform, an index read step needs to be “enabled” on the machine for sequencing. Otherwise, the machine does not perform the index read step (since it wouldn’t be necessary for a single library). Surprisingly, the sample submission form for the Univ. of Oregon Genomics Core Facility  doesn’t request any information regarding whether or not a submitted sample has been multiplexed. However, by default, they enable the index read step on all sequencing runs. I provided them with the barcodes and they demultiplexed them after the fact.

I downloaded the new, demultiplexed files to Owl/nightingales/C_gigas:

lane2_CTTGTA_L002_R1_001.fastq.gz
lane2_CTTGTA_L002_R1_002.fastq.gz
lane2_CTTGTA_L002_R1_003.fastq.gz
lane2_CTTGTA_L002_R1_004.fastq.gz
lane2_GCCAAT_L002_R1_001.fastq.gz
lane2_GCCAAT_L002_R1_002.fastq.gz
lane2_GCCAAT_L002_R1_003.fastq.gz
lane2_GCCAAT_L002_R1_004.fastq.gz
lane2_GCCAAT_L002_R1_005.fastq.gz
lane2_GCCAAT_L002_R1_006.fastq.gz

Notice that the file names now contain the corresponding index!

Renamed the files, to append the order number to the beginning of the file names:

$for file in lane2*; do mv "$file" "2212_$file"; done

New file names:

2212_lane2_CTTGTA_L002_R1_001.fastq.gz
2212_lane2_CTTGTA_L002_R1_002.fastq.gz
2212_lane2_CTTGTA_L002_R1_003.fastq.gz
2212_lane2_CTTGTA_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_001.fastq.gz
2212_lane2_GCCAAT_L002_R1_002.fastq.gz
2212_lane2_GCCAAT_L002_R1_003.fastq.gz
2212_lane2_GCCAAT_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_005.fastq.gz
2212_lane2_GCCAAT_L002_R1_006.fastq.gz

Updated the checksums.md5 file to include the new files (the command is written to exclude the previously downloaded files that are named “2212_lane2_NoIndex_”; the [^N] regex excludes any files that have a capital ‘N’ at that position in the file name):

$for file in 2212_lane2_[^N]*; do md5 "$file" >> checksums.md5; done

Updated the readme.md file to reflect the addition of these new files.

from Sam’s Notebook http://ift.tt/1CI7nM1