Friday, July 28, 2017

BioSmalltalk workflow for ancestral haplotype analysis published in Animals


Recently we published in the Animal Journal, an article analyzing Bovine Lymphocyte antigen (BoLA) region of the Brangus cattle. We used LAMP-LD (Local Ancestry Inference in Admixed populations) which is a window-based algorithm combined within a hierarchical Hidden Markov Model to represent haplotypes in a population and allows to estimate enrichment from two parental populations. The ancestral haplotype is the one which appeared first in the parental populations. As we do not know which one it is, we assume it is the most abundant combining both of them, and considering its association to the mixing ratio.

In the script below, we analyzed MHC regions (this is chromosome 23 for Cattle) and generated a histogram plot on annotating the ranges on these regions. The script was adapted for general usage, and each step is commented.


For replicating the workflow into your Bioinformatics experiments, you will need the reference alleles of your SNPs. Here, we extracted them from the Affymetrix TXT report, exported from the Axiom Analysis Suite software. The PLINK wrapper is used to filter families of interest and genotyping error rate. Our cattle families are labeled as 'AA' for Angus, 'BRANG' for Brangus and 'BRAH' for Brahman respectively. If you do not have family information in your PED file, then you will have to add it manually or using tools like UNIX command-line utilities (cut, paste, etc). We also calculated the effective population size for each family (which is commonly referred as "Ne" in the literature) and used as parameter of ShapeIt.

Plotting was performed using the ROASSAL visualization engine, for which I should thank to Alexandre Bergel for its tireless help with their impressive API.

| shapeItSamplesDir baseDir reportFile alleleAFile plinkFile |

" Base directory for all input and output files "
baseDir := '.'.
" Axiom Analysis Suite TXT Report "
reportFile := baseDir , 'AAS_report.txt'.
alleleAFile := baseDir , 'allele_A.txt'.
plinkFile := baseDir , 'Bos1_REF-FINAL'.

" Extract Allele_A column from Affymetrix report file to use it as reference alleles "
BioAffyTXTFormatter new
    inputFile: reportFile;
    outputFilename: alleleAFile;

" Build the PLINK BED file with reference alleles "
" The SNPs_BoLA_Bos1-annotdb32.txt is just a text file used to extract the SNPs of interest in the MHC region from the Microarray output "
BioPLINK2Wrapper new
    bfile: plinkFile;
    out: baseDir , 'Bos1_REF-FINAL_REFA';
    referenceAlleles: alleleAFile;
    extract: baseDir , 'SNPs_BoLA_Bos1-annotdb32.txt';

" Apply families of interest filter "
BioPLINK2Wrapper new
    bfile: baseDir , 'Bos1_REF-FINAL_REFA';
    out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1';
    keepFams: #('AA' 'BRANG' 'BRAH');  

" Apply basic genotyping error rate filter  "
BioPLINK2Wrapper new
    bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1';
    out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC';
    geno: 0.05;

" Regenerate BED for duplicated BRAH lines "
" Split data set into 3 separate BED files by family "
#('AA' 'BRANG' 'BRAH') do: [ : famName |
    BioPLINK2Wrapper new
        file: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        keepFam: famName;
        execute ].

" Run ShapeIt with Ne parameters "
#('AA' -> 100 . 'BRANG' -> 106 . 'BRAH' -> 159) do: [ : assoc |
    BioShapeIt2WrapperR644 new
        inputBinarized:  baseDir , 'Bos1_REF-FINAL-' , assoc key , '_r1';
        outputMax:  baseDir , 'Bos1_REF-FINAL-AA_r1';
        effectiveSize: assoc value;
        execute ].

" Generate .GENO file for LAMP-LD "
BioLAMPLDGenotypeFormatter new
    bimFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.bim';
    pedFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.ped';
    outputFilename: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno';
" Extract positions to build .POS file for LAMP-LD "
(baseDir , '') asFileReference
    cut: #(4)
    to: (baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') asFileReference writeStream.

" Transpose HAPS files "
" ... to be included "

" Run LAMP-LD "
shapeItSamplesDir := 'ShapeItFiles' asFileReference.
BioLAMPLDWrapper new
    windowSize: 100;
    nrFounders: 2;
    positionsFile:     (shapeItSamplesDir / 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') fullName;
    addPopFile:         (shapeItSamplesDir / 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC-ShapeIt2-Ne100.haps.ref') fullName atOrder: 1;
    addPopFile:         (shapeItSamplesDir / 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC-ShapeIt2-Ne150.haps.ref') fullName atOrder: 2;
    genosFile:          (shapeItSamplesDir / 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno') fullName;
    outputFile:         baseDir , 'lamp-ld_example1.out';
" Convert positions to mbases "
BioLAMPLDWrapper new
    positionsFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos';

BioLAMPLD2WayVisualizer new
 lineWidth: 2;
 posFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC_1000.pos';
 population1Name: 'Angus' color: Color red;
 population2Name: 'Brahman' color: Color blue;
 readExpanded: baseDir , 'postlampld_ws-100.txt' title: 'Angus versus Brahman';
 addGenomicRangeBelowXFrom: 27862 to: 27989 label: 'C I';
 addGenomicRangeBelowXFrom: 25350 to: 25592 label: 'C IIa';
 addGenomicRangeBelowXFrom: 7011 to: 7534 label: 'C IIb';
 addGenomicRangeBelowXFrom: 26973 to: 27576 label: 'C III'; 

The resulting image can be viewed below:

Depending on your version of ShapeIt, if you have few samples for a population shapeIt v2 won't run with the following error message:

src/modes/phaser/phaser_algorithm.cpp:147: void phaser::phaseSegment(int): Assertion `conditional_index[segment].size() > 5' failed.

Of course you should get more samples, but for continuing testing the workflow you could just duplicate lines and change the ID's of the duplicated samples. This is how to do it:

" First, you split data set into 3 separate PED files by family, but recode the file as textual format (PED) "
#('AA' 'BRANG' 'BRAH') do: [ : famName |
    BioPLINK2Wrapper new
        bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC';
        out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        keepFam: famName;
        execute ].

" Create a new temporary file "
(baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference
    copyTo: (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference.

" Duplicate BRAH lines with different IDs "
(FileStream fileNamed: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped')
    ifNotNil: [ : stream |
        stream setToEnd.
        (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference readStream contents linesDo: [ : line |
            | tkl newId |
            tkl := line findTokens: Character tab.
            newId := 'A' , tkl second.
            tkl at: 2 put: newId.
                nextPutAll: (tkl joinUsing: Character tab);
                nextPutTerminator ] ].

" Rename again to the original file "
(baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference renameTo: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped'.

Monday, June 19, 2017

Convert rehh output to UCSC Bed file

The rehh (Relative Extended Haplotype Homozygosity) R package provides several scores for detecting recent natural positive selection taking as input SNP data. Resulting regions with high p-values can be considered as candidates for selective sweeps.

The following BioSmalltalk script reads rehh output files from ihs and rsb functions (ihh2ihs() or ies2rsb()), and for each chromosome, collects peaks given by p-values to generate an UCSC BED file (Browser Extensible Data). BED files can be used to view the genome annotations in significant regions using Genome Browsers, such as the Animal Genome Browser, through the "add custom track" option.

The collector object first filter valid (not null) -log10(p-value)'s which are greater than ylim parameter, this is done to find significative p-values. Then scans resulting collection to detect a peak range, checking if difference between two adjacent SNP's is greater than the nbases parameter. Finally the output is exported to BED file format (#dumpAsUCSCBedFile, #dumpAsEnsembleBedFile or #dumpAsEnsembleBedFileBothStrands), ensuring each feature is written in correct order

| minChr maxChr cwd nbases |
minChr := 1.
maxChr := 29.
cwd := 'c:\...'.
nbases := 1000000.

minChr to: maxChr do: [ : chrNumber | 
 | inputFilename outputFilename |
 inputFilename := 'chr_' , chrNumber asString , '.rsb.txt'.
 outputFilename := 'peaks_chr' , chrNumber asString , '.bed'.
 BioRehhLogPValueCollector new
  cwd: cwd;
  ylim: 2.0;
  nbases: nbases;
  chrNumber: chrNumber;
  inputFile: inputFilename;
  outputFilename: outputFilename;
  dumpAsEnsembleBedFileBothStrands ]

Friday, August 26, 2016

A ShapeIt2 wrapper is available


One of the latest additions in BioSmalltalk is a wrapper for running the well-known ShapeIt2 software (actually is ShapeIt v2). ShapeIt is a fast and accurate method for estimation of haplotypes (a.k.a. phasing) from a set of SNP genotypes (.ped format or its .bed/.bim/.fam binary version) and a genetic map (.map format), and produces as output, either a single set of estimated haplotypes, or a haplotype graph that encapsulates the uncertainty about the underlying haplotypes. The software is currently only available in Unix-like OS.


To use the wrapper the program binary must be in the system PATH environment variable and all input files, being binarized PLINK (bed, bim, fam) or textual PLINK (ped, map) must share the same name. The following expression launches ShapeIt2 from BioSmalltalk, setting several parameters such as:
  • The number of burn-in MCMC iterations
  • The input file name (without extension),
  • The output file name for the best haplotypes
  • The number of threads to use the multi-threading capabilities
 BioShapeIt2WrapperR727 new
  burn: 10;
  inputBinarized: 'input_brangus';
  outputMax: 'output_brangus';
  thread: 8;
If you like to explicitly specify
 BioShapeIt2WrapperR644 new
  burn: 10;
  inputTextual: 'input_brangus.ped';
  inputMap: '';
  outputMax: 'output_brangus';
  thread: 8;


Now the BioShapeIt2Wrapper is a superclass for specialized subclasses, each one representing a particular release of ShapeIt2. When I started the wrapper the binary executable of ShapeIt2 was named "shapeit.v2.r644.linux.x86_64", then I checked "shapeit.v2.r727.linux.x64" was released but cannot be run in CentOS 6.x. So you want to keep older version, and also know which binaries are available (it does not mean they are installed in your system of course):
BioShapeIt2Wrapper releases 
"an OrderedCollection('shapeit.v2.r644.linux.x86_64' 'shapeit.v2.r727.linux.x64')"

Wednesday, August 17, 2016

PhyloclassTalk was used to solve a homicide

PhyloclassTalk, an open-source phylogeographic text-mining system based in BioSmalltalk, was used in veterinary forensics to solve a homicide! The September 2016 issue of Legal Medicine includes an article which fully describes the case in detail. PhyloclassTalk was used to narrow blasted sequences of the species (Canis Familiaris) and extract proper meta-data (Breed names) from NCBI's GenBank. A hand-crafted database of dog breeds was built and integrated into a PhyloclassTalk repository to classify (by breed name) and observe the ones located in Argentina, where the sample of and individual was found in a crime scene. Finally it was also used to build and export the results to Arlequin format. PhyloclassTalk paper is almost completed, meanwhile a beta release of the software can be downloaded from its web site.

Thursday, August 20, 2015

Browsing +1,2 million formal scientific names from the NCBI Taxonomy Database.

Contents of this post does not require to load or install BioSmalltalk or PhyloclassTalk, but uses a plain Pharo image with the FastTable package.

As part of the PhyloclassTalk project I wanted to add a feature to browse all formal scientific names found in the full NCBI taxonomy database. The recently published FastTable package in the pharo mailing-list makes me wonder how well will perform to open a FastTable Morphic window with its contents. You can also download the taxonomy dump list I used for this experiment. I filtered the original file ( to remove "noise" (synonyms, authorities). Using a Sony Vaio i3 at 2.40Ghz it takes just 4 seconds, and you get a fully scrollable list, without pagination, without lags. First we open the FastTable widget with a basic benchmark:
Smalltalk garbageCollect.
| speciesDumpReader speciesDumpList | 
speciesDumpReader := 'scientific_names.dmp' asFileReference readStream.
speciesDumpList := speciesDumpReader contents lines.
FTEasyListMorph new
  extent: 300@550;
  elements: speciesDumpList;
] timeToRun. 
 "0:00:00:03.968" "0:00:00:04.249" "
Now let's go for a more functional species "browser" by adding a callback to open the Google results for the selected taxa:
| speciesDumpReader speciesDumpList | 
speciesDumpReader := 'scientific_names.dmp' asFileReference readStream.
speciesDumpList := speciesDumpReader contents lines.
FTEasyListMorph new
 header: 'NCBI Taxonomy Database List';
 extent: 300 @ 550;
 elements: speciesDumpList;
 menu: [ : taxaName | 
  MenuMorph new 
   add: ('Open External Browser') 
                        target: NBWin32Shell 
                        selector: #shellBrowse: 
                        argument: '' , taxaName;
   yourself ];  
and of course, a screenshot:
I hope to see more of this cool packages coming to the Pharo Smalltalk community. Enjoy!

Friday, March 20, 2015

BioSmalltalk now available through GitHub

I have created the BioSmalltalk repository in GitHub so you can clone and contribute from there.

I hope this will make it easy for interested parties to contribute to this code or to specialize it to their own needs. Regular distributions will still be made at Google Code (for now) but if you want the absolute latest changes, GitHub will be the place to go.

If you are interested, please feel free to get involved.




Monday, December 22, 2014

Download a human chromosome in one line of code

Let's write plain Smalltalk code to download the Human chromosome 22 FASTA from the NCBI servers (about 9,6 Mbytes gzip compressed)
| client fileName fStream |

fileName := 'hs_alt_HuRef_chr22.fa.gz'.
[ client := (FTPClient openOnHostNamed: '')
                loginUser: 'anonymous' password: '';
                changeDirectoryTo: 'genomes/H_sapiens/CHR_22'.
(FileStream newFileNamed: fileName)
        nextPutAll: (client getFileNamed: fileName);
        close ]
on: NetworkError, LoginFailedException
do: [ : ex | self error: 'Connection failed' ].

fStream := fileName asFileReference readStream.
(ByteArray streamContents: [ : stream |
    FLSerializer serialize: fStream binary contents on: stream ]) storeString.
That seems a lot of typing for a Bioinformatics library and Smalltalk tradition. That's why I wrote a Genome Downloader class which makes really easy to download the latest build:
BioHSapiensGD new downloadChromosome: 22.
If you don't want the blocking feature, you can easily download in background by setting the process priority:
[ BioHSapiensGD new downloadChromosome: 22 ] 
 forkAt: Processor userBackgroundPriority 
 named: 'Downloading Human Chromosome...'.
Results will be downloaded in the directory where the virtual .image and .changes files are located. But why stop at human race? There are subclasses for Bos Taurus (from the UMD, Center for Bioinformatics and Computational Biology, University of Maryland, and The Bovine Genome Sequencing Consortium), Gallus Gallus (International Chicken Genome Sequencing Consortium) and Mus Musculus (Celera Genomics and Genome Reference Consortium) and others can be built by just specializing very few methods. We can just download any available assembled genomes with just one line of code. Enjoy.

Tuesday, January 14, 2014

Arlequin format writer


Arlequin is a famous software for population genetics data analysis. The file format is well documented in the Arlequin's Manual, so I will not duplicate information here. Writing an Arlequin file consists of basically generating a customized INI file with both Profile and Samples sections.
Now you can use the API provided in BioSmalltalk to write Arlequin files programatically. The API pattern in the most naive form looks like this
arlequinFile := BioArlequinFile new.
arlequinFile profileSection
        addTitle: 'Sample Title';
        " ... profile configuration messages ... ".

arlequinFile samplesSection
        addSampleName: '"SAMPLE1"';
        addSampleSize: '8';
        addSampleData: " ... sample data 1 ... ";

        addSampleName: '"SAMPLE2"';
        addSampleSize: '8';
        addSampleData: " ... sample data 2 ... ";       

        " ... you guessed it ... "

it seems pretty simple, but in practice you will not type the hundreds of samples in a typical Arlequin data set. You would like to iterate your input data. 

Building the Samples Collection
If you observe the pattern above, each sample contains three pieces of information: Sample Name, Sample Size and Sample Data. Basically you have two input layouts. Each population comes from separate collections, i.e.:
| arlequinFile samplesSection samplesCollection idCollection frqCollection |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

idCollection := #('OT1' 'B1' 'A1' 'CH1' 'J1' 'USA1' 'OT2' 'OT3' 'B2' 'A2' 'A3' 'A4' 'USA2' 'USA3' 'USA4' 'USA5' 'USA6' 'USA7' 'OT4' 'B3' 'B4' 'B5' 'A5' 'J2' 'J3' 'USA8' 'USA9' 'USA10' 'USA11' 'USA12' 'USA13' 'B6' 'C1' 'J4' 'USA14' 'OT5' 'OT6' 'B7' 'CH2' 'CH3' 'A6' 'CH4' 'A7').
frqCollection := #(5 5 6 3 2 11 1 2 1 1 1 1 1 2 1 1 1 1 5 2 1 1 1 1 1 1 1 4 1 1 1 3 1 1 2 4 3 1 1 1 1 1 1).
samplesSection addSamples: (BioA31SampleCollection forDNA
 identifiers: idCollection;
 frequencies: frqCollection;
 sequences: samplesCollection;

" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Or population data comes as a triplet. This could be the case after you have grouped your input by alignment and calculated the frequencies. In that case you may use #triplesDo: to take each population by 3-element and build your Arlequin file like this:
| arlequinFile samplesSection populations |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

populations triplesDo: [ : id : freq : seq |
  addSampleName: id;
  addSampleSize: freq;
  addSampleData: seq;
  yourself ].
" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Don't forget to check BioArlequinFile convenience methods for building for different data types: #buildHaplotypicDataDNAProfileTitle: aString groups: aNumber missingData: missingCharacter #buildHaplotypicDataFrequencyProfileTitle: aString groups: aNumber missingData: missingCharacter And let me know any suggestions for improving the Arlequin API.

Saturday, February 23, 2013

PhyloclassTalk preview

In this post I want to present a preview of PhyloclassTalk, an application for phylogenetics analysis using the BioSmalltalk environment with Pharo 1.4. The main parts are presented through the Metro style UI popularized in Windows 8. The following screenshot shows the main application window:

excepting for the icons, the layout was generated programatically with simple and plain Morphic objects. The "Territory Builder" uses a wizard library called Merlin and it is based in a Territorial library which basically is a Composite pattern implementation to build complex territorial objects. I have integrated the Help in just one hour, based in the HelpSystem without any previous knowledge of the library.

The main module window is a "Case Study Browser" implemented with the OmniBrowser framework. From the browser one can create and associate several phylogenetic data to a species case study, classify according to defined territories and then export results into formats like Arlequin, Google Fusion Tables or Fluxus Network.

The following screenshot describes the "Blast Query Builder", which enables dynamic generation and execution of Blast XML results, producing filtered objects which can be later loaded in the case study browser for further associations. Fitered results could be cumulative, meaning that each new execution is applied on the previous results.

Detailed features as the rule engine protocol and the post-curation of classified data are going to be described an the upcoming paper. I will provide also new posts on this front as I prepare a release, stay there online.

Saturday, January 26, 2013

BioSmalltalk 0.4 Release

Time to talk about the new 0.4 release. BioSmalltalk virtual-images include basic developement libraries for doing bioinformatics, for example XML APIs, OS services, HTTP requests, serialization, refactoring, etc. Both BioPharo and BioSqueak are editions of BioSmalltalk and includes specific platform packages, as all Smalltalk platforms evolves on their own.

There are separate downloads for different OSes: Windows, Linux and MacOS X. And there are different "Editions" which currently are: Pharo 1.4 and Squeak 4.4. Main developement is happening in Pharo 1.4. The Squeak edition is experimental and does not contain menu entries for targeted browsing and updating.

Please feel free to comment or register into the mailing list provided by Google Code hosting.