Conservation Pipeline

Installation

Follow the below steps to install linc2function_validation_data in your computer.

Change directory

From the home directory which will be open by default, change to a suitable directory on your computer where the utility needs to be installed. For example, in this tutorial we have changed to workspace directory.

user@hostname:~$ cd workspace

Clone

In the workspace directory, clone the current version of linc2function_validation_data repository from the GitHub.

username@hostname:~/workspace$git clone https://gitlab.com/yram0006/linc2function_validation_data.git

Open linc2function_validation_data

Open the linc2function_validation_data directory that is downloaded from GitHub after cloning.

username@hostname:~/workspace/linc2function_validation_data$cd linc2function_validation_data

Open ucr_overlaps

Open the ucr_overlaps directory that is downloaded from GitHub after cloning.

username@hostname:~/workspace/linc2function_validation_data$cd ucr_overlaps

Download data

Download the multiple sequence alignment file hg38.7way.maf.gz from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz7way/ by running the following command;

username@hostname:~/workspace/linc2function_validation_data/ucr_overlaps$wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz7way/hg38.7way.maf.gz

From this file, we can obtain the well conserved regions amonst 7 different species;

Human
Chimp
Rhesus
Dog
Mouse
Rat
Opossum

Create .bed files

Conserved Regions

The multiple sequnce alignment files containing conserved regions are to be converted into .bed format containing start and end co-oridate positions of the reference genome (Human) to make it comparable with the predicted binding sites. The following utility function can be used in this case;

username@hostname:~/workspace/linc2function_validation_data/ucr_overlaps$python maf_to_bed.py hg38.7way.maf hg38.7way.bed

Binding Sites

Next, the predicted binding sites are to be converted into .bed format containing start and end co-oridate positions of the reference genome (Human) to make it comparable with the conserved regions. The following utility function can be used in this case;

Make sure the predicted binding sites for each testing sequence is saved under Sequence_lncRNA and directory Sequence_lncRNA_bed is created before running the command.

username@hostname:~/workspace/linc2function_validation_data/ucr_overlaps$python binding_sites_to_bed.py

Install bedtools

To compare the intersection of the above two .bed files, which essentially gives the binding sites which are well conserved we user bedtools library. Instructions regarding installation of this library can be found here: https://bedtools.readthedocs.io/en/latest/content/installation.html

Run bedtools

To find the overlap between the bed files for the binding sites corresponding to every transcript, we have created another utility function. This function iterates though all the files in the Sequence_lncRNA_bed directory and invokes the intersection command. Please make sure Sequence_lncRNA_intersect directory is created before running this script as all the results will be saved under this directory.

username@hostname:~/workspace/linc2function_validation_data/ucr_overlaps$sh run_bed_tools.sh

Analyse output

The output from the above command gives the overlaps for each binding site indivdually. To aggregate the results at transcript level and obtain the number of overlapping binding sites that are predicted for each species, the following script might be executed;

for f in ./*;do echo $f; cat $f | awk -F'\t' '{print $4}' | sort | uniq -c; done