PROMPT-Protein Mapping and Comparison Tool

Case study: comparison of pI distributions

Case study: Comparison of isoelectric point distributions

View as Movie BeanShell script along with data (zipped)

Keywords:

computable sequence properties, visualisations (built in and generic), comparison of numeric distributions

Initial situation:

We are interested in how the extremophile Halobacterium salinarum and Buchenera sp APS are adapted to their environments. H.salinarum is a halophilic organism that lives in high salt concentrations and Buchnera is an endosymbiont of aphids.

Additionally we want to compare the two human pathogens H.pylori and E.coli and see how both are adapted to their specific environment. H.pylori lives in the acidic stomach whereas E.coli can be found in the basic intestine.

Questions:

Do the protein pI distributions differ depending on the environmental needs?

Data:

We have multi FASTA files with the protein sequences downloaded from NCBI:

File	Genbank identifier
Halobacterium_salinarum.fasta	NC_002607
Buchnera_sp_APS.fasta	AP001118 + AP001119
ecoli.fasta	NC_00913
hpylori.fasta	AE001439

Steps

Step 1: Data import

Simply import the datasets to PROMPT by using the FASTA import feature. Choose “Import -> FASTA -> File” and choose “protein” as sequence type in the following dialog.

Step 2: Analysis & Results

Select both inputs (keep the CTRL-key pressed while clicking at both input lines).

Both datasets contain the amino acid sequences of the proteins. Therefore we can just let PROMPT calculate the isoelectric point of these proteins and see how the two organism differ in this respect.

From the PROMPT menu choose

"Analyze -> Computable Sequence Properties -> pICompare"

this is equivalent to

"Analyze -> Generic Annotations -> Compare annotations between 2 sets -> Numeric feature comparison" and choosing "pI" in the following dialogs

PROMPT automatically applies the Mann-Whitney and the Kolmogorov-Smirnov test to the whole numeric distributions. The Mann-Whitney test (MW-test) is a rank test with the null hypothesis that the means of both distributions are equal. The Kolmogorov-Smirnov test (KS-test) tries to determine if two datasets differ significantly. The KS-test has the advantage of making no assumption about the distribution of data. Technically speaking, it is non-parametric and distribution free. Note, however, that this generality comes at some cost: other tests (for example Student's t-test) may be more sensitive if the data meet the requirements of the test. Additionally statistical values like median, standard deviation, minimum or maximum of both distributions are returned and can be shown in the PROMPT spread sheet viewer as shown in Figure 1.

Figure 1. Screenshot of PROMPT's spreadsheet viewer showing results of a generic comparison of numeric distributions. For each results short descriptions explain the denotation of the respective values.

In addition to the statistical tests between distributions,a histogram with the absolute values in each bin as well as the relative fraction is calculated. The binning can be done automatically or easily customized with the help of a dialog guided wizard. This may allow one to detect local differences between 2 distributions that would not be detected in an overall analysis. Statistical significance is provided by a Chi-Square and a Mann-Whitney test for each bin separately. The Chi-Square test shows if the frequency difference is significant. The Mann-Whitney test indicates whether the distribution means within the bins differ.

To visualize the results, select the histogram and use the right mouse click on the PICompare- or Compare:numeric result to open the pop up menu and choose the Visualisation option.

Figure 2. Comparison of the isolectric point distribution of the proteins of Halobacterium and Buchnera. On the Y-Axis the fraction of proteins that have a pI that falls within the respective bin relative to the amount of proteins in the Halobacterium or Buchnera set is plotted. The [X] in the interval labels indicates that a Mann-Whitney test returned a significant p-value at a significance level of 0.05 for the value within this bin. The stars on top of the red bars show that the observed difference differs significantly from the expectations as tested by a Chi-Square test (p-values * <0.05, ** <0.01, *** <0.001)

Summary:

PROMPT can automatically calculate a multitude of sequence-based properties like pI or molecular mass.
PROMPT can compare any numerical distributions:
As shown in the external data example it is possible to use this type of analysis on any numeric external data.
PROMPT tests for statistical significance automatically
provides various immediately ready-to-go visualisations

Further exercises:

Compare the pI of E.coli and H.pylori. What would you expect and why is the result only at first glance surprising?