### 2007 week 01: Articles in Maths

In systems biology, biologically relevant quantitative modelling of physiological processes requires the integration of experimental data from diverse sources. Recent developments in high-throughput methodologies enable the analysis of the transcriptome, proteome, interactome, metabolome and phenome on a previously unprecedented scale, thus contributing to the deluge of experimental data held in numerous public databases. In this review, we describe some of the databases and simulation tools that are relevant to systems biology and discuss a number of key issues affecting data integration and the challenges these pose to systems-level research.

**Strategies for dealing with incomplete information in the modeling of molecular interaction networks**

Modelers of molecular interaction networks encounter the paradoxical situation that while large amounts of data are available, these are often insufficient for the formulation and analysis of mathematical models describing the network dynamics. In particular, information on the reaction mechanisms and numerical values of kinetic parameters are usually not available for all but a few well-studied model systems. In this article we review two strategies that have been proposed for dealing with incomplete information in the study of molecular interaction networks: parameter sensitivity analysis and model simplification. These strategies are based on the biologically justified intuition that essential properties of the system dynamics are robust against moderate changes in the value of kinetic parameters or even in the rate laws describing the interactions. Although advanced measurement techniques can be expected to relieve the problem of incomplete information to some extent, the strategies discussed in this article will retain their interest as tools providing an initial characterization of essential properties of the network dynamics.

Systems biology applies quantitative, mechanistic modelling to study genetic networks, signal transduction pathways and metabolic networks. Mathematical models of biochemical networks can look very different. An important reason is that the purpose and application of a model are essential for the selection of the best mathematical framework. Fundamental aspects of selecting an appropriate modelling framework and a strategy for model building are discussed.

Concepts and methods from system and control theory provide a sound basis for the further development of improved and dedicated computational tools for systems biology. Identification of the network components and rate constants that are most critical to the output behaviour of the system is one of the major problems raised in systems biology. Current approaches and methods of parameter sensitivity analysis and parameter estimation are reviewed. It is shown how these methods can be applied in the design of model-based experiments which iteratively yield models that are decreasingly wrong and increasingly gain predictive power.

**Estimating the parameters of a model for protein-protein interaction graphs**

We find accurate approximations for the expected number of three-cycles and unchorded four-cycles under a stochastic distribution for graphs that has been proposed for modelling yeast two-hybrid protein–protein interaction networks. We show that unchorded four-cycles are characteristic motifs under this model and that the count of unchorded four-cycles in the graph is a reliable statistic on which to base parameter estimation. Finally, we test our model against a range of experimental data, obtain parameter estimates from these data and investigate possible improvements in the model. Characterization of this model lays the foundation for its use as a prior distribution in a Bayesian analysis of yeast two-hybrid networks that can potentially aid in identifying false-positive and false-negative results.

**A new approach to intensity-dependent normalization of two-channel microarrays**

A two-channel microarray measures the relative expression levels of thousands of genes from a pair of biological samples. In order to reliably compare gene expression levels between and within arrays, it is necessary to remove systematic errors that distort the biological signal of interest. The standard for accomplishing this is smoothing "MA-plots" to remove intensity-dependent dye bias and array-specific effects. However, MA methods require strong assumptions, which limit their general applicability. We review these assumptions and derive several practical scenarios in which they fail. The "dye-swap" normalization method has been much less frequently used because it requires two arrays per pair of samples. We show that a dye-swap is accurate under general assumptions, even under intensity-dependent dye bias, and that a dye-swap removes dye bias from a single pair of samples in general. Based on a flexible model of the relationship between mRNA amount and single-channel fluorescence intensity, we demonstrate the general applicability of a dye-swap approach. We then propose a common array dye-swap (CADS) method for the normalization of two-channel microarrays. We show that CADS removes both dye bias and array-specific effects, and preserves the true differential expression signal for every gene under the assumptions of the model.

**Regularized linear discriminant analysis and its application in microarrays**

In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (*and others*, 2003

**Are clusters found in one dataset present in another dataset?**

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network.

**A topologically related singularity suggests a maximum preferred size for protein domains**

A variety of protein physicochemical as well as topological properties, demonstrate a scaling behavior relative to chain length. Many of the scalings can be modeled as a power law which is qualitatively similar across the examples. In this article, we suggest a rational explanation to these observations on the basis of both protein connectivity and hydrophobic constraints of residues compactness relative to surface volume. Unexpectedly, in an examination of these relationships, a singularity was shown to exist near 255-270 residues length, and may be associated with an upper limit for domain size. Evaluation of related G-factor data points to a wide range of conformational plasticity near this point. In addition to its theoretical importance, we show by an application of CASP experimental and predicted structures, that the scaling is a practical filter for protein structure prediction. Proteins 2007. © 2006 Wiley-Liss, Inc.

Dynamic programming (DP) and its heuristic algorithms are the most fundamental methods for similarity searches of amino acid sequences. Their detection power has been improved by including supplemental information, such as homologous sequences in the profile method. Here, we describe a method, probabilistic alignment (PA), that gives improved detection power, but similarly to the original DP, uses only a pair of amino acid sequences. Receiver operating characteristic (ROC) analysis demonstrated that the PA method is far superior to BLAST, and that its sensitivity and selectivity approach to those of PSI-BLAST. Particularly for orphan proteins having few homologues in the database, PA exhibits much better performance than PSI-BLAST. On the basis of this observation, we applied the PA method to a homology search of two orphan proteins, Latexin and Resuscitation-promoting factor domain. Their molecular functions have been described based on structural similarities, but sequence homologues have not been identified by PSI-BLAST. PA successfully detected sequence homologues for the two proteins and confirmed that the observed structural similarities are the result of an evolutional relationship. Proteins 2007 © 2006 Wiley-Liss, Inc.

The motif prediction problem is to predict short, conserved subsequences that are part of a family of sequences, and it is a very important biological problem. Gibbs is one of the first successful motif algorithms and it runs very fast compared with other algorithms, and its search behavior is based on the well-studied Gibbs random sampling. However, motif prediction is a very difficult problem and Gibbs may not predict true motifs in some cases. Thus, the authors explored a possibility of improving the prediction accuracy of Gibbs while retaining its fast runtime performance. In this paper, the authors considered Gibbs only for proteins, not for DNA binding sites. The authors have developed iGibbs, an integrated motif search framework for proteins that employs two previous techniques of their own: one for guiding motif search by clustering sequences and another by pattern refinement. These two techniques are combined to a new double clustering approach to guiding motif search.

...

Tests on the PROSITE database show that their framework improved the prediction accuracy of Gibbs significantly. Compared with more exhaustive search methods like MEME, iGibbs predicted motifs more accurately and runs one order of magnitude faster. Proteins 2007. © 2006 Wiley-Liss, Inc.

*Thanks for stopping in! We hope you'll be back!*

**Ads make the world go around. Help us out!**

Labels: cluster, dynamic system, hydrophobic, microarray, modeling, probability, protein, proteome, resources, stochastic, topology

## 0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

## Links to this post:

Create a Link

<< Home