# Are hierarchical Dirichlet processes useful in practice

## Service navigation

### Project descriptions

**Project A1:**** Experimental plans for model identification in crossover studies**

**Prof. Dr. Joachim Kunert**

Optimal experimental designs in crossover studies usually focus on getting the most out of the main effects of the treatments. In the model with carryover effects, this sometimes means that some effects do not appear at all in the optimal designs. This is the case, for example, in the model with mixed and self-carryover effects, see Kunert and Stufken (2007). The article by Druilhet and Tinsson (2007) therefore determines experimental plans that estimate the "permanent effects", i.e. the sum of direct effects and self-carryover effects, as well as possible in the same model. However, this only makes sense if the model is actually "correct". If the traditional model applies, in which the same carryover effect always occurs, other "permanent effects" result. For this reason, it is interesting to determine experimental designs that can differentiate between the two models. One possible approach tries to identify test plans that receive the best possible estimates for the differences between the mixed and self-carryover effects.

In doing so, however, the additional condition must be taken into account that after the decision on the model, the most efficient possible estimation of the permanent effects is necessary. This means that test plans can be determined that are efficient according to different criteria at the same time.

**literature**

- Druilhet, P., and Tinsson, W. (2007): Optimal Repeated Measurements Designs in a Model with Partial Interactions. Preprint.
- Kunert, J., and Stufken, J. (2002): Optimal Crossover Designs in a Model with Self and Mixed Carryover Effects. Journal of the American Statistical Association 97, 898-906.
- Kunert, J., and Stufken, J. (2007): Optimal Crossover Designs for Two Treatments in the Presence of Mixed and Self Carryover Effects. Preprint.

There are links to the projects A2 (crossover design), B2 (variance analysis models), B6 (violation of the assumption of independence) and C6 (test plans).

**Project A2: Optimal crossover designs for comparing treatments with a control**

**Prof. Dr. Joachim Kunert**

Most of the published literature on the optimality of crossover designs deals with the situation in which all contrasts are equally interesting. When determining optimal crossover designs, the technical difficulty arises that no closed form can be specified for the track of the information matrix. The method of Kushner (1997) has led to great advances in this technical difficulty. A-optimal designs for comparing treatments to a control usually have a different structure than the designs in which all contrasts are of equal interest. Few papers attempt to determine crossover plans for comparison with a control, see Hedayat and Yang (2005, 2006), but cannot use the Kushner method. If one is looking for A-optimal test plans to compare treatments with a control, determining the trace of the information matrix alone is not sufficient. You also need the sum of all elements of the information matrix.

In the article Bailey and Kunert (2006) it was possible to transfer the Kushner method to the A-optimality. It can be assumed that this adaptation of the Kushner method in an analogous manner also allows an adaptation to the A-optimality for comparison with a control. The aim of this project is the transfer of the Kushner method to the A-optimality for the comparison of treatments with a control in a model with carryover effects and the determination of optimal test plans with this method. Results are to be achieved for different models and the robustness of the test plans for changes in the model is to be examined. For example, the efficiency of the optimal test plans from a given model is of interest in other possible models.

**literature**

- Bailey, R.A., and Kunert, J. (2006): On Optimal Crossover Designs When Carryover Effects are Proportional to Direct Effects. Biometrika 93, 613-625.
- Hedayat, A.S., and Yang, M. (2005): Optimal and Efficient Crossover Designs for Comparing Test Treatments with a Control Treatment. Annals of Statistics 33, 915-943.
- Hedayat, A.S., and Yang, M. (2006): Efficient Crossover Designs for Comparing Test Treatments with a Control Treatment. Discrete Mathematics 306, 3112-3124.
- Kushner, H.B. (1997): Optimal Repeated Measurements Designs: the Linear Optimality Equations. Annals of Statistics 25, 2328-2344.

There are links to projects A1 (crossover design) and C11 (model robustness).

### Project B1: Non-parametric Bayesian regression with qualitative structural assumptions

**Prof. Dr. Katja Ickstadt**

The development of computer-intensive Monte Carlo methods in the last few decades has made it possible to analyze completely new model classes. In the last few years, in particular, non-parametric Bayesian models have come to the fore, which on the one hand avoid unnecessary parametric assumptions, but on the other hand can integrate existing prior knowledge. The basic building blocks of non-parametric Bayesian models are usually stochastic processes, e.g. Gauss, Dirichlet or Lévy processes. For an overview of the growing literature on nonparametric Bayesian methods, see Müller and Quintana (2004).

Within the scope of this project, non-parametric Bayesian regression models are to be developed, which take qualitative prior knowledge of the underlying application problem into account. One such application is, for example, in the field of clinical studies in pharmaceutical drug development. The prior knowledge here often consists in the fact that the relationship between two variables (e.g. dose and effect, time and concentration) is monotonic or unimodal. Prior knowledge about a specific parameterization of the underlying curves is typically not available, so that the development of non-parametric methods with prior information about the curve shape is necessary.

A general approach to nonparametric Bayesian regression is to represent the function to be modeled as a linear combination (or mixture) of parametric core functions, with the underlying mixture measure based a-priori on a Lévy process. This approach was first implemented by Wolpert and Ickstadt (1998) for counting data and an application in spatial statistics. Clyde and Wolpert (2007) provide an overview of other applications of the model and embed it in the formal framework of the so-called "overcomplete dictionaries", a flexible class of core functions.

Within the scope of this project, the model is to be expanded taking into account qualitative structural assumptions such as unimodality or monotony. A suitable Markov Chain Monte Carlo method must be developed for the efficient analysis of the extended model class. The performance of the developed model is ultimately to be evaluated using data from a clinical study and also in a simulation study.

**literature**

- Clyde, M.A., and Wolpert, R.L. (2007): Nonparametric Function Estimation Using Overcomplete Dictionaries. In: Bernardo, J.M., et al. (Ed.): Bayesian Statistics 8. Oxford University Press, 91 114.
- Müller, P., and Quintana, F.A. (2004): Nonparametric Bayesian Data Analysis. Statistical Science 19, 95-110.
- Wolpert, R.L., and Ickstadt, K. (1998): Poisson / Gamma Random Field Models for Spatial Statistics. Biometrika 85, 251-267.

There are links to projects B4 (regularization in regression), C4 (qualitative assumptions, spatial statistics), C5 (Lévy processes as a modeling component), C6 (process modeling and analysis using MCMC methods) and D4 (implementation of the algorithms).

**Project B2: Statistical Measurement Models and Generalized Statistical Inference**

**Prof. Dr. Joachim Hartung, Dr. Guido Knapp**

In almost all empirically working sciences, knowledge about data is gained in suitable measurement models, which are then supposed to lead to "statistically reliable" results using mostly complicated mathematical-statistical analysis instruments. The aim of the research project is to further develop this analysis tool for a wide range of relevant open problems, such as those in the fields of technology and industry in quality assessment and assurance, in medicine in clinical trials and epidemiological studies or in economics in econometric studies Investigations occur.

In a number of measurement models, assumptions are made in such a way that an exact (and optimal) statistical analysis with regard to testing hypotheses or constructing confidence intervals is possible. However, if model assumptions are violated, only approximate solutions can usually be given. Tsui and Weerahandi (1989) proposed the concept of generalized P-values for testing hypotheses when e.g. B. Disturbance parameters do not allow the construction of exact tests. Based on this, Weerahandi (1993) presented the general construction of generalized confidence intervals.

These two concepts have already been used successfully in some measurement models. So have i.a. Hamada and Weerahandi (2000) used these concepts to appropriately analyze the repeatability and comparative precision in measurement systems. In the case of models with repeated measurements and unequal variances, Ho and Weerahandi (2007) have shown how the two concepts can be used and what advantages these methods have compared to classical methods that are based on more restrictive model assumptions. These two newer concepts of generalized statistical inference now give room for further applications in measurement models, in which e.g. B. Disturbance parameters make an exact classical statistical analysis difficult.

In regression models with an ANOVA error structure, for example, such as can occur in panel data, cf. Knapp (2002), suitable estimation and test methods are still pending in the case of unbalanced sample sizes and / or heteroscedastic error variances. Tests on the covariance parameters and the influence of the covariance parameter estimates on the tests on the regression parameters have not yet been adequately researched.

#### literature

- Hamada, M., and Weerahandi, S. (2000): Measurement System Assessment via Generalized inference. Journal of Quality Technology 32, 241-253.
- Ho, Y.Y., and Weerahandi, S. (2007): Analysis of Repeated Measures under Unequal Variances. Journal of Multivariate Analysis 98, 493-504.
- Knapp, G. (2002): Variance Estimation in the Error Components Regression Model. Communications in Statistics - Theory and Methods 31, 1499-1514.
- Tsui, K.-W., and Weerahandi, S. (1989): Generalized P-Values in Significance Testing of Hypotheses in the Presence of Nuisance Parameter. Journal of the American Statistical Association 84, 602-607.
- Weerahandi, S. (1993): Generalized Confidence Intervals. Journal of the American Statistical Association 88, 899-905.

There are links to projects A1 (measurement models) and C4 (deviations from measurement models).

### Project B3: Multi-criteria optimization of correlated quality features with the help of the desirability index

**Prof. Dr. Claus Weihs**

The desirability index introduced by Harrington (1965) is a method from the field of multi-criteria quality optimization of (production) processes. It has found a high level of acceptance in practice, especially since the 1990s (see e.g. Carro and Lorrenzo, 2001; Basu et al., 2002; Shyy et al., 2001; Parker and DeLoach, 2002).

Given the influencing variables and quality characteristics of the process under consideration, the desirability index transforms the multi-criteria problem into a univariate in several steps. After the relationship between the quality features and influencing variables has been mapped using mathematical models, mostly using statistical test planning methods, experts determine specification limits and a so-called desirability function for each feature. This causes a scale transformation of the features in the interval [0, 1]. The following applies: the higher the desirability, the better the process quality in relation to the quality feature under consideration. In this way, the quality of the individual features can be compared directly; different units of measurement are no longer relevant.

The desirability index is then used to summarize the desirability in a global univariate quality measure. This in turn varies in the interval [0, 1] and is mostly defined as the geometric mean value of the desirability functions. Alternatively, a Maximin approach (cf. Kim and Lin, 2000) is used, i.e. the aim is to maximize the minimum process quality with regard to the individual quality features. The interpretation is intuitive: the larger the desirability index, the more desired, i.e. the better the overall process quality. The desirability index and thus the global process quality can then be optimized using non-linear optimization methods based on the relationship with the influencing variables.

Theoretical research in this area has long been neglected. In this respect, the knowledge of the statistical distribution of the optimized desirability index is of central importance in order to be able to assess, above all, the uncertainty of the optimization result. The optimization results are mainly distorted if the process under consideration has not been described in sufficient detail by mathematical models and / or has a high degree of variability. The functional relationships between influencing factors and quality features generally result from a test planning phase and are obtained using model estimates. The resulting models, however, always contain an error that is usually not taken into account during optimization. The quality gain expected through optimization cannot be guaranteed in the ongoing process. This can lead to serious process fluctuations and also deterioration in process quality.

With the help of the distribution of the desirability index (see Steuer, 2005; Trautmann and Weihs, 2006), on the other hand, optimization algorithms could be developed that take this uncertainty into account by optimizing the expected value of the desirability index (see Steuer, 2005), and forecast intervals for the optimized value of the Desirability indexes are set up (cf. Trautmann, 2004). It could also be shown that with the help of desirability, expert opinions can be used to limit the amount of Pareto-Optima (cf. Mehnen and Trautmann, 2006); Mehnen et al., 2007; Weihs and Trautmann, 2007; Trautmann and Mehnen, 2008).

In the course of the optimization approach described, correlations between the individual quality features are generally not taken into account. However, these are normally to be expected in practice and, precisely due to the multiplicative structure of the geometric mean, can lead to distortions of the optimization results. In order to be able to apply the approach of optimizing the expected value of the desirability index, which is more robust to model errors, also for correlated quality features, the aim of the dissertation project is to derive or approximate the distribution of desirability indices for correlated quality features. The first approaches can be found in Trautmann (2004) and Henkenjohann (2006). In order to enable an analytical determination of the density and distribution function and in particular the expected value of the desirability index, it might be useful to modify the desirability functions originally introduced by Harrington (1965) and Derringer and Suich (1980). Alternatively, correlations could be included via an alternative definition of the desirability index.

#### literature

- Basu, S., Gaur, R., Gomes, J., et al. (2002): Effect of Seed Culture of Solid-State Bioconversion of Wheat Straw by Phanerochaete chrysosporium for Animal Feed Production. Journal of Bioscience and Bioengineering 93 (1), 25-30.
- Carro, A.M., and Lorenzo, R.A. (2001): Simultaneous Optimization of the Solid-Phase Extraction of Organochlorine and Organophosphorus Pesticides Using the Desirability Function. Analyst 126, 1005-1010.
- Derringer, G.C., and Suich, D. (1980): Simultaneuous Optimization of Several Response Variables. Journal of Quality Technology 12 (4), 214-219.
- Harrington, J. (1965): The Desirability Function. Industrial Quality Control 21 (10); 494-498.
- Henkenjohann, N. (2006): An adaptive sequential procedure for the efficient optimization of the CNC-controlled spinning process. Dissertation, Technical University Dortmund, Faculty of Statistics. http://hdl.handle.net/2003/23260.
- Kim, K.-J., and Lin, D.K.J. (2000): Simultaneous Optimization of Mechanical Properties of Steel by Maximizing Desirability Functions. Applied Statistics 49 (3), 311-326.
- Mehnen, J., and Trautmann, H. (2006): Integration of Expert's Preferences in Pareto Optimization by Desirability Function Techniques. In: Teti, R. (Ed.): Proceedings of the 5th CIRP International Seminar on Intelligent Computation in Manufacturing Engineering (CIRP ICME '06), Ischia, Italy, 293-298, ISBN: 88-95028-01-5 978- 88-95028-01-9, July 25-28, 2006.
- Mehnen, J., Trautmann, H., and Tiwari, A.(2007): Introducing User Preference using Desirability Functions in Multi-objective Evolutionary Optimization of Noisy Processes. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Singapore, September 24-28, 2007 (published).
- Parker, P.A., and DeLoach, R. (2002): Structural Optimization of a Force Balance Using a Computational Experiment Design. (Invited), 40th AIAA Aerospace Sciences Meeting and Exhibit, American Institute of Aeronautics and Astronautics, Nevada, AIAA-2002-0540, http://techreports.larc.nasa.gov/ltrs//PDF/2002/aiaa/NASA- aiaa-2002-0540.pdf.
- Shyy, W., Paila, N., Vaidyanathan, R., et al. (2001): Global Design Optimization for Aerodynamics and Rocket Propulsion Components. Progress in Aerospace Sciences 37, 59-118.
- Steuer, D. (2005): Statistical properties of multi-criteria optimization by means of desirabilities. Dissertation, Technical University Dortmund, Faculty of Statistics. http://hdl.handle.net/2003/20171.
- Trautmann, H. (2004): Quality control in industry based on control cards for desirability indices - field of application warehouse management. Dissertation, Technical University Dortmund, Faculty of Statistics. http://hdl.handle.net/2003/2794.
- Trautmann, H., and Weihs, C. (2004): Uncertainty of the Optimum Influence Factor Levels in Multicriteria Optimization Using the Concept of Desirability. Technical Reprt 23/04, SFB 475, University of Dortmund.
- Trautmann, H., and Mehnen, J. (2008): Preference-Based Pareto-Optimization in Certain and Noisy Environments. Engineering Optimization (submitted).
- Trautmann, H., and Weihs, C. (2006): On the Distribution of the Desirability Index Using Harrington's Desirablity Function. Metrika 63 (2), 207-213.
- Weihs, C., and Trautmann, H. (2007): Parallel Universes: Multi-Criteria Optimization. In: Berthold, M.R., Morik, K., and Siebes, A. (Eds.): Parallel Universes and Local Patterns. http://drops.dagstuhl.de/opus/volltexte/2007/1255/.

There are links to projects B6 (distribution of desirability indices for correlated quality characteristics), C1 (test statistics that include correlations), C4 (multi-criteria problem) and C8 (dependency measures).

**Project B4: Regularization procedure for robust variable selection in the linear model**

**Prof. Dr. Ursula Gather**, **PD Dr. Sonja Kuhnt**

Today, high-dimensional data are often available to answer questions from various subject sciences. The number of influencing variables that can be used to explain, control or forecast a dependent variable can be very large. Under certain circumstances - as is often the case in the life sciences - it is even considerably larger than the number of available observations. In multiple linear regression, the classic KQ estimation quickly becomes unusable due to strongly correlated or superfluous influencing variables. Unnecessary influencing variables must therefore be removed or their influence limited. As an alternative to the selection of variables using t-tests or AIC, regularization methods have been proposed that stabilize the estimate by penalizing the size of the parameter vector in a suitable norm. The best-known variant, the ridge regression, penalizes the Euclidean norm of the parameter vector (Gruber, 1998). Current methods such as LASSO (Tibshirani, 1996) or the recently proposed "Dantzig selector" (Candes and Tao, 2007) use the L1 norm to force an economical occupation of the parameter vector. Possible dissertation topics are the investigation of the influence of outliers and violations of the model assumptions on the result of such regularization procedures as well as the development of robust alternatives.

#### literature

- Candes, E., and Tao, T. (2007): The Dantzig Selector: Statistical Estimation when p is Much Larger than n. With Discussion. Appears in: Annals of Statistics.
- Gruber, M.H.J. (1998): Improving Effiency by Shrinkage. Dekker, New York.
- Tibshirani, R. (1996): Regression Shrinkage via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.

There are links to the projects B1 (regularization in regression), B5 (robust modeling), B6 (robust modeling), C1 (dimension reduction), C2 (LASSO regression), C5 (model selection) and D1 (robust model selection for linear time series models) .** **

**Project B5: Robust Classification**

**Prof. Dr. Ursula Gather**, **PD Dr. Sonja Kuhnt**

Parametric classification methods are used when assumptions can be made about the class densities or their likelihood quotients. Classic examples of such methods are linear discriminant analysis, quadratic discriminant analysis and logistic regression. A violation of the assumptions on which the procedure is based can significantly impair the quality of the classification. Robust classification procedures are therefore necessary in the event of outliers or other model deviations. Some suggestions for the robustification of classification procedures already exist in the literature (Croux and Dehon, 2001; He and Fung, 2000; Joossens, 2006). Doctoral topics are to be assigned to investigate the sensitivity of existing methods as well as to develop new, robust classification methods and to determine their statistical properties, even in the case of more than two classes, and to classify time-dependent data. This is to be done partly in cooperation with Hannu Oja (University of Tampere, Finland).

#### literature

- Croux, C., and Dehon, C. (2001): Robust Linear Discriminant Analysis Using S-Estimators. The Canadian Journal of Statistics 29, 473-492.
- He, X.M., and Fung, W.K. (2000): High Breakdown Estimation for Multiple Populations with Applications to Discriminant Analysis. Journal of Multivariate Analysis 72, 151-162.
- Joossens, K. (2006): Robust Discriminant Analysis. PhD Thesis, Faculty of Economics and Applied Economics, Katholieke Universiteit Leuven.

There are links to the projects B4 (robust modeling), B6 (robust modeling), C4 (robust classification), D2 and D3 (classification method) and D4 (numerical problems of robust classification).

**Project B6: Robustness of statistical procedures against disruption of independence**

**Prof. Dr. Ursula Gather**, **PD Dr. Sonja Kuhnt**

When using many statistical methods, assumptions are made with regard to the underlying random variables, which guarantee desired properties such as optimality, undistortion, etc. One of the most common assumptions about a sequence of random variables is that these random variables are independently identically distributed, such as in the one-sample t-test or the one-sample Kolmogorov-Smirnov test. If the stochastic independence of the random variables is violated, however, the usefulness of many statistical methods is questionable. The effects of the violation of independence on a selection of frequently used statistical methods should be examined in the context of a dissertation. Based on these findings, alternative methods can also be developed for statistical processes, which in this sense have proven to be unstable. The restriction of the violation of stochastic independence by means of mixing coefficients offers a possibility to model dependency structures for which asymptotic results are already known from the literature. In Dedecker and Prieur (2007) the asymptotic behavior of empirical processes in the case of addiction is characterized. In Baklanov (2006) there are results on the asymptotic behavior of L-statistics for strictly stationary and ergodic sequences. In addition to analytical investigations, this subject area also requires experimental investigations that allow practical access to the task with finite and small samples. The choice of suitable dependency structures and the development of methods for generating random sequences that have these dependency structures represent a further challenge.

**literature**

- Baklanov, E.A. (2006): The Strong Law of Large Numbers for L-Statistics with Dependent Data. Siberian Mathematical Journal 47, 975-979.
- Dedecker, J., and Prieur, C. (2007): An Empirical Central Limit Theorem for Dependent Sequences. Stochastic Processes and their Applications 117, 121-142.

There are links to projects A1 (violation of the assumption of independence), B3 (distribution of desirability indices for correlated quality characteristics), B4 (robust modeling), B5 (robust modeling), C1 (processes that are robust against correlations), C10 (effects of dependencies on statistical processes ) and D1 (robust modeling of dependencies).

**Project C1****: Dimension reduction in high-dimensional genetic measurements using gene group tests**

**Prof. Dr. Jörg Rahnenführer**

With microarray experiments, the gene expression of thousands of genes is measured simultaneously. The measurements describe the gene activity at certain times or under certain experimental conditions. They usually serve as a starting point for investigating the underlying biology. Long lists of differentially expressed genes are usually analyzed. The integration of structural, regulatory or enzymatic properties of the associated proteins leads to a significant improvement in the functional interpretation of the results.

In biology and medicine, it is now common practice to analyze, in addition to the most interesting genes, also statistically conspicuous gene groups that are in a given, mostly functional, context. If many genes of a biologically defined group, e.g. all genes that play a role in the immune defense, are expressed significantly differently between a patient and a control group, this suggests that the corresponding function, here the immune defense, plays an important role in the Disease plays (Goeman et al., 2007). In a second step, the corresponding members of this gene group are then examined more closely. The calculation of statistical significance for gene groups also provides global functional profiles with high biological interpretability.

An important problem in calculating the significance of gene groups is the high level of redundancy, since many gene groups are strongly overlapping. This leads to highly correlated test statistics and thus to the loss of significant results in the typically large number of tested groups if adjustments are made for multiple testing. In recent years, various algorithms have been developed that can be used to determine the relevance of gene groups in microarray experiments (Alexa et al., 2005; Mansmann et al., 2005). The groups examined were 'Gene Ontology' (GO) classes. The GO provides an assignment of genes to biological processes and molecular functions, which are arranged hierarchically. Our own methods have used the special complex structure of Gene Ontology to heuristically decorrelate the test statistics. It has been shown that these algorithms demonstrably identify more biologically relevant processes than classic, established methods (Alexa et al., 2005). The methods have also been successfully applied to prostate cancer data.

There are now more than 30 publications in the literature that describe methods for testing groups of genes. However, the decorrelation of test statistics due to overlapping gene groups was only dealt with for the specific application to GO classes. This project aims to set two priorities. On the one hand, methods are to be developed that can also be applied to other gene groups, such as genes that belong to a common metabolic path or to defined parts of genetic networks. Since the subset relationships from the hierarchical structure of the GO are lost, new concepts have to be developed. A first approach could be a sequential process that iteratively finds the most significant gene group from the overall list, given the partial list of the previously identified gene groups. Here methods of sequential testing need to be investigated. On the other hand, different test statistics are to be compared to calculate the significance of gene groups. The most popular methods only use the sequence of the significance of individual genes, but dependencies between genes can be recorded, for example, by test statistics that include the correlation between genes (Rahnenführer et al., 2004).

The project is carried out in close cooperation with the Max Planck Institute for Computer Science in Saarbrücken. The existing cooperation with Prof. Mansmann from the LMU Munich is to be expanded.

#### literature

- Newton, M.A., Quintana, F.A., Den Boon, J.A., Sengupta, S., and Ahlquist, P. (2007): Random-Set Methods Identify Distinct Aspects of the Enrichment Signal in Gene-Set Analysis. The Annals of Applied Statistics 1 (1), 85-106.
- Schulz, WA, Alexa, A., Jung, V., Hader, C., Hoffmann, MJ, Yamanaka, M., Fritzsche, S., Wlazlinski, A., Müller, M., Lengauer, T., Engers, R., Florl, AR, Wullich, B., and Rahnenführer, J. (2007): Factor Interaction Analysis for Chromosome 8 and DNA Methylation Alterations Highlights Innate Immune Response Suppression and Cytoskeletal Changes in Prostate Cancer. Molecular Cancer 6, Article 14.
- Goeman J.J., and Bühlmann, P. (2007): Analyzing Gene Expression Data in Terms of Gene Sets: Methodological Issues. Bioinformatics 23 (8), 980-987.
- Alexa, A., Rahnenführer, J., and Lengauer, T. (2006): Improved Scoring of Functional Groups from Gene Expression Data by Decorrelating GO Graph Structure. Bioinformatics 22 (13), 1600 1607.
- Mansmann, U, and Meister R. (2005): Testing Differential Gene Expression in Functional Groups. Goeman’s Global Test versus an ANCOVA Approach. Methods of Information in Medicine 44 (3), 449-453.
- Rahnenführer, J., Domingues, F.S., Maydt, J., and Lengauer, T. (2004): Calculating the Statistical Significance of Changes in Pathway Activity from Gene Expression Data. Statistical Applications in Genetics and Molecular Biology 3 (1), Article 16.

There are links to projects B3 (each with competing goals), B4 (dimension reduction), B6 (methods that are robust against correlations), C2 (bioinformatics data), C10 (dimension reduction), D3 (cluster and classification methods for dimension reduction) and D4 (numerical problems with high dimension).

**Project C2: Statistical models for the dependence of survival times on complex genetic markers**

**Prof. Dr. Jörg Rahnenführer**

New experimental techniques in molecular biology have resulted in a deluge of new genetic data in recent years. This often high-dimensional data enables a better understanding of the biological processes that trigger and control diseases. In cancer research in particular, there is hope that, as a result of better models for the development and progression of tumors, more reliable diagnosis and better therapy decisions can be made. One example is better classifications of different forms of cancer with the help of microarray data.

In recent years we have developed a biostatistical model for genetic progression in human tumors (Beerenwinkel et al., 2005) and evaluated it statistically and clinically in a variety of ways (Rahnenführer et al., 2005; Toloşi, 2006; Bogojeska, 2007). In this model, progression is described by the irreversible, mostly sequential accumulation of somatic changes in cancer cells. Our mixture model of oncogenetic trees is characterized by high interpretability and enables the introduction of a genetic progression score that quantifies the genetic progress of a patient's disease in a univariate manner. Using Cox models from survival time analysis, it was possible to demonstrate that for patients with prostate cancer or with various types of brain tumors, a higher genetic score correlates with a shorter time to relapse or death (Rahnenführer et al., 2005; Ketter et al, 2007) .

The clinical meaningfulness of such a tumor progression model depends on the one hand on the stability of the statistical model and on the other hand on the prediction quality of the derived scores for the survival times of interest. Simulation studies have already shown that the topology of our progression model and thus also the derived scores cannot always be reliably estimated (Bogojeska, 2007). The forecast quality has yet to be assessed using methods of estimating optimism in the models (Schumacher et al., 2007).

The aim of this project is to use genetic data from tumor samples to develop markers that enable classification of the associated patients with significantly different survival prognoses. An appropriate compromise should be found between model interpretability, model stability and prediction quality. The data can be expression measurements, CGH data or epigenetic measurements. Special questions will be the appropriate selection of characteristics from the genetic measurements, the combination of characteristics to interpretable scores and the adaptation of methods to evaluate the correlation with survival times.

While our models can be unstable due to their complexity, the simple counting of genetic changes, which is popular in medicine, leads to scores that are too simple with poor quality of classification. In this project, compromises are to be found.A starting point will be weighted sums of genetic changes in which the number of relevant changes is kept small by regularization approaches, similar to the application of a LASSO regression. An alternative is the adaptation of our progression models, which relax the strict assumption of the sequential accumulation of genetic changes. This can be done either by adding a noise term in the models or by using other model estimation methods.

Another desirable property of the new models is robustness against outliers in both genetic data and survival times.

#### literature

- Ketter, R., Urbschat, S., Henn, W., Kim, Y.-J., Feiden, W., Beerenwinkel, N., Lengauer, T., Steudel, W.-I., Zang, KD, and Rahnenführer, J. (2007): Application of Oncogenetic Trees Mixtures as a Biostatistical Model of the Clonal Cytogenetic Evolution of Meningiomas. International Journal of Cancer 121 (7), 1473-1480.
- Bogojeska, J. (2007): Stability Analysis for Oncogenetic Trees. Master's thesis, Saarland University.
- Schumacher M., Binder H., and Gerds T. (2007): Assessment of Survival Prediction Models Based on Microarray Data. Bioinformatics 23, 1768-1774.
- Toloşi, L. (2006): Analysis of ArrayCGH Data for the Estimation of Genetic Tumor Progression. Master's thesis, Saarland University.
- Rahnenführer, J., Beerenwinkel, N., Schulz, WA, Hartmann, C, von Deimling, A., Wullich, B., and Lengauer, T. (2005): Estimating Cancer Survival and Clinical Outcome Based on Genetic Tumor Progression Scores . Bioinformatics 21 (10), 2438-2446.
- Beerenwinkel, N., Rahnenführer, J., Däumer, M., Hoffmann, D., Kaiser, R., Selbig, J., and Lengauer, T. (2005): Learning Multiple Evolutionary Pathways from Cross-Sectional Data. Journal of Computational Biology 12 (6), 584-598.

There are links to projects B4 (robustness, LASSO regression), C1 (bioinformatic data), C9 (censored length of stay, Cox regression), C11 (survival times), D1 (outliers), D2 (classification with more than two classes) and D4 (Numerical Problems with High Dimension).

**Project C4: Statistical Modeling of Music: **

From their creation to their perception

From their creation to their perception

**Prof. Dr. Claus Weihs**

Music can be understood as a time series of vibrations that change not only in time but also spatially. Models for musical sounds typically relate to a small period of time and a point in space, e.g. at which the human ear is located. In fact, such sounds undergo manifold changes from their generation to perception, which lead to model changes that have not yet been investigated as an overall process.

In this project, the entire process of generation, resonance, spatial transmission and perception of musical sounds is to be modeled. Based on scientific models (see e.g. Roederer, 2000), the statistical fluctuations of the musical signals should first be modeled. As an example of a typical fluctuation, the vibrato will be examined in different musical instruments. Models for musical sounds that contain vibrato (see e.g. Rossignol et al., 1999; Weihs et al., 2006) should be compared before and after transformations through resonance, spatial sound and perception, ie the transformation of the spectral distributions, the deterministic model parts and the error distributions are examined.

The aim of this project is to investigate the perception of the monophonic and polyphonic music played in different instrument-room listening situations. In addition to understanding the transformation of statistical distributions, understanding the physics of sound generation, resonance and room acoustics (see e.g. Roederer, 2000) as well as the physiological processes during hearing (see e.g. Szepannek et al., 2006) and neurological ones also play a role Processing what is heard (cf. e.g. Petkov et al., 2006) plays an essential role. The hearing processes should be converted into classification models to identify the output signal classes (e.g. note heights and lengths).

#### literature

- Petkov, C., Kayser, C., Augath, M., and Logothetis, N. (2006): Functional Imaging Reveals Numerous Fields in the Monkey Auditory Cortex. PLoS Biology 4, 1213-1226.
- Roederer, J.G. (2000): Physical and psychoacoustic basics of music. 3rd ed., Springer, Berlin.
- Rossignol, S., Depalle, P., Soumagne, J., Rodet, X., and Colette, J.-L. (1999): Vibrato: Detection, Estimation, Extraction, Modification. In: Proceedings of the COST-G6 Workshop on Digital Audio Effects (DAFx-99).
- Szepannek, G., Harczos, T., Klefenz, F, Katai, A., Schikowski, P., and Weihs, C. (2006): Vowel Classification by a Perceptually Motivated Neurophysiologically Parameterized Auditory Model. In: Decker, R., and Lenz, H. (Eds.): Advances in Data Analysis. Springer, Heidelberg, 653-660.
- Weihs, C., Ligges, U., and Sommer, K. (2006): Analysis of Music Time Series. In: Rizzi, A., and Vichi, M. (Eds.): COMPSTAT-2006 - Proceedings in Computational Statistics. Physica, Heidelberg, 147-159.

There are links to projects B1 (qualitative assumptions, spatial statistics), B2 (deviations from measurement models), B3 (multi-criteria problem), B5 (robust classification), C5 (signal transmission networks), C6 (spatial signal modeling), D1 (robust time series analysis), D2 (multi-class problem) and D3 (spectra).

**Project C5: Modeling of signal transmission networks**

**Prof. Dr. Roland Fried, Prof. Dr. Katja Ickstadt**

Signal transduction is the process of converting extra-cellular signals into inner-cellular signals, which stimulate functional responses in the cell. The basic questions here are how a particular response is triggered by a given stimulus and how this response is regulated.

Mathematical modeling using systems of ordinary differential equations has made important contributions to a better understanding of cellular signal transmission networks. However, the determination and interpretation of the model parameters (e.g. association and dissociation rates, protein concentrations, etc.) can be very difficult with a large number of molecules to be considered. The newer approach of the Modular Response Analysis (MRA) divides the network into functional modules and derives the interaction strengths between the modules from experimental data. The successful use of this methodology on a small signal transmission network recently allowed the identification of a positive feedback that controls the development of adrenal pheochromocytoma (PC12) cells in rats (Santos, Verveer and Bastiaens, 2007). This approach, which can be classified as an inverse problem, proves to be successful on small scales, but is difficult to transfer to large systems and is also deterministic.

In reality, the functional modules and their connections are to be regarded as stochastic. Amounts of protein vary from cell to cell. Consequently, both the functional modules, which represent protein concentrations, and the connections between the modules, which describe the composite effects of protein levels and network parameters, will have a probability distribution. Thus, the alternative approach aimed at in this project, to take into account intercellular variability using statistical methods to determine the model parameters, appears promising.

As a stochastic alternative to the MRA approach, Bayesian networks have already been used in the literature to determine causal protein signal transmission networks from cell data (Sachs et al., 2005). In this modeling approach, however, feedbacks, which are essential properties of cellular signal transmission networks, cannot be taken into account.

In this project, therefore, such graph-based methods are to be expanded to include feedback. These are to be modeled in the functional modules as an inverse problem under uncertainty, whereby an adaptation of the Bayesian approach based on Lévy processes by Wolpert and Ickstadt (2004) is planned (see also Wolpert, Ickstadt and Hansen, 2003). The functional modules should then be linked by means of graphic modeling (see Fried and Didelez, 2003, 2005).

#### literature

- Fried, R., and Didelez, V. (2003): Decomposability and Selection of Graphical Models for Multivariate Time Series. Biometrika 90, 251-267.
- Fried, R., and Didelez, V. (2005): Latent Variable Analysis and Partial Correlation Graphs for Multivariate Time Series. Statistics & Probability Letters 73, 287-296.
- Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., and Nolan, G.P. (2005): Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308, 523-529.
- Santos, S.D.M., Verveer, P.J., and Bastiaens, P.I.H. (2007): Growth Factor-Induced MAPK Network Topology Shapes Erk Response Determining PC-12 Cell Fate. Nature Cell Biology 9, 324-330.
- Wolpert, R.L., and Ickstadt, K. (2004): Reflecting Uncertainty in Inverse Problems: A Bayesian Solution Using Lévy Processes. Inverse Problems 20, 1759-1771.
- Wolpert, R.L., Ickstadt, K., and Hansen, M.B. (2003): A Nonparametric Bayesian Approach to Inverse Problems (with discussion). In: Bernardo, JM, Bayarri, MJ, Berger, JO, Dawid, AP, Heckerman, D., Smith, AFM, and West, M. (Eds.): Bayesian Statistics 7, Oxford University Press, Oxford, 403-417 .

There are links to the projects B1 (Lévy processes as a modeling module), B4 (model selection), C4 (signal transmission networks), C6 (common application area signals) and D1 (robust modeling of dependencies).

### Project C6: Spatial modeling of cellular signals

**Prof. Dr. Katja Ickstadt**

When it comes to signal transmission from cells, it is assumed that spatial effects such as gradients, spatial trends or cluster formation play a decisive role. Molecular clusters in the plasma membrane are one example. Regulatory GTPase Ras, a small, plasma membrane-resident protein that plays an important role in signal transmission and tumor development, forms clusters of 4 to 10 Ras proteins in small (10-20 nm) areas, depending on activation status and interactions with other proteins. The cluster sizes in turn influence the signal transmission by means of these proteins and are therefore of great interest for biomedical research.

It is also assumed that stochastic effects play a role in cellular signal transmission, e.g. in describing the dynamics of small protein clusters of the order of 4 to 10 molecules.

So far, cellular transmission networks have been modeled with the help of ordinary differential equations. To understand the spatial and stochastic effects, these models can be expanded to include a spatial component (partial differential equations) or a stochastic component (stochastic differential equations) (van Zon and ten Wolde, 2005, Ander et al., 2004).

In this project a different approach is chosen in which the spatial effects as well as the stochastic phenomena are modeled directly. For this purpose, models from spatial statistics, e.g. hierarchical Poisson / Gamma models (Wolpert and Ickstadt, 1998) and cluster models (see e.g. Knorr-Held and Raßer, 2000) are adapted for the application field of cellular signal transmission. The various proteins can be described using a marked point process (Ickstadt and Wolpert, 1999). In a further step, the dynamics of the signal transmission can be recorded by generalizing the spatial models into spatial-temporal models.

Another aspect of cellular signal transmission are measurement difficulties that have to be taken into account for successful modeling. Protein clusters may be too small to be observed with a fluorescence microscope and, in addition to the stochastic effects mentioned above, noise plays a decisive role. In the course of this project, experiments and statistical evaluations will ideally mutually improve, i.e. the results of the statistical analysis control future experiments, which in turn increase the data quality for new analyzes.

#### literature

- Ander, M., Beltrao, P., Di Ventura, B., Ferkinghoff-Borg, J., Foglierini, M., Kaplan, A., Lemerle, C., Tomás-Oliveira, I., and Serrano, L. (2004): SmartCell, a Framework to Simulate Cellular Processes that Combines Stochastic Approximation with Diffusion and Localization: Analysis of Simple Networks. Systems Biology 1, 129-138.
- Ickstadt, K., and Wolpert, R.L. (1999): Spatial Regression for Marked Point Processes. In: Bernardo, J.M., Berger, J.O., Dawid, A.P. and Smith, A.F.M. (Eds.): Bayesian Statistics 6, Oxford University Press, Oxford, 323-341.
- Knorr-Held, L., and Raßer, G. (2000): Bayesian Detection of Clusters and Discontinuities in Disease Maps. Biometrics 56, 13-21.
- van Zon, J.S., and ten Wolde, P.R. (2005): Green’s Function Reaction Dynamics: A Particle-Based Approach for Simulating Biochemical Networks in Time and Space. Journal of Chemical Physics 123, 234910-1-234910-16.
- Wolpert, R.L., and Ickstadt, K. (1998): Poisson / Gamma Random Field Models for Spatial Statistics. Biometrika 85, 251-267.

There are links to the projects A1 (test plans), B1 (modeling using stochastic processes and MCMC methods), C4 (spatial signal modeling), C5 (common application area signals) and D3 (clusters).

**Project C7: The spread of financial crises in international capital markets**

**Prof. Dr. Walter Kramer**

As recent economic history shows, financial crises that arise in any country seem to spread "by contagion", so to speak, to other economies. Examples are the East Asian crisis of the late 1990s, the Mexican peso crisis of 1992 or the Russian crisis of 1998. Is this a "normal" process or is it the consequence of a structural break in an otherwise stable system of economic equations? Or from the technical point of view of econometrics: Can such phenomena be explained by a structure-constant model or does one need the assumption of a structural break in the coefficients of a relevant model for explanation?

To clarify this question, different models for multivariate return distributions are first estimated (vector autoregressive processes, factor models). The literature has shown that many of these models cannot be exactly identified because of the large number of coefficients. Here, additional identifying restrictions should be looked for. Existing structural rupture tests are then to be applied to these models and new ones to be developed.

Another research question in this context concerns the connection between structural breakage and long memory. As a special methodological innovation, the derivation of the zero distribution of selected test variables is also provided for cases in which returns do not have finite higher moments.

#### literature

- Arestis, P., Caporale, G., Cipolini, A., and Spagnolo, N. (2005): Testing for Financial Contagion between Developed and Emerging Markets during the 1997 East Asian Crisis. International Journal of Finance and Economics, 10, 359-367.
- Dungrey, Y., and Tambakis, D. (2005): Identifying International Financial Contagion. Oxford University Press.
- Forbes, K., and Rigobon, R. (2002): No Contagion, only Interdependence: Measuring Stock Market Comovements. Journal of Finance, 57, 2223-2261.
- Lineis, A., Kleiber, C., Krämer, W., and Hornik, K. (2003): Testing and Dating of Structural Changes in Practice. Computational Statistics & Data Analysis, 44, 109-123.

There are links to projects C8 (measures of dependency for capital market returns), C9 (homogeneous Markoff processes with structural breaks between individual process phases

**Project C8: Time-variable dependencies in the returns of high-risk securities**

**Prof. Dr. Walter Kramer**

In bad economic times, the dependency on capital market returns seems to increase compared to upswing phases. This is worrying in several ways. In particular, the diversification effect of large portfolios is lost precisely when it is needed most.

The project aims to explain this phenomenon economically and to model it statistically. In a first step, suitable measures of dependency must be found. It is well known that the Bravais-Pearson correlation coefficient, which is almost always used unquestionably as a measure of dependency in applications, has various disadvantages in the case of returns that are not jointly normally distributed. In particular, it contains hardly any information about tail dependence, which is particularly important in times of crisis. First of all, measures of dependency based on copulas can be found here, which are independent of monotonous transformations of the returns and, depending on the design, more sensitive to marginal dependencies.

Given a specific measure of dependency, it must then be checked whether the empirically observed higher values in times of crisis are perhaps not an artifact of a statistical condition: It has long been well known that the conditional correlation, given absolutely high values of one variable, the unconditional correlation below Circumstances dramatically overestimated. The same effects must also be checked for competing measures of dependency and, if necessary, eliminated.

#### literature

- Campbell, R., Forbes, C., Koedjk, K., and P. Kofman (2008): Diversification Meltdown or just Fat Tails. Appears in Empirical Finance.
- Falk, M., and Michel, R. (2006): Testing for Tail Dependence in Extreme Value Models. Annals of the Institute of Statistical Mathematics (AISM), 58, 261-290.
- Härdle, W. (2007): Copulae in Tempore Varientes. Lecture at the DAG-Stat conference "Statistics under one roof", Bielefeld, March 27-30, 2007.
- King, M., Sentana, E., and Wadhwani, S. (1994): Volatility and Links between National Stock Markets. Econometrica, 62, 901-933.
- Longin, F., and Solnik, B. (1995): Is the Correlation in International Equity Returns Constant: 1960-1990 ?. Journal of International Money and Finance, 14, 3-26.
- Schmid, F., and Schmidt, R. (2007): Multivariate Conditional Versions of Spearman Rho and Related Measures of Tail Dependence. Appears in: Journal of Multivariate Analysis.
- Siburg, K.F., and Stoimenov, P. (2007): A Measure of Mutual Complete Dependence. Working Paper, Technical University of Dortmund.
- Siburg, K.F., and Stoimenov, P. (2007): A Scalar Product for Copulas. Working Paper, Technical University of Dortmund.

There are links to projects B3 (dependency measures), C7 (models for multivariate return distributions) and C10 (time-variable dependencies for portfolio risks).

**Project C9: The formation of rating models using empirical processes**

(Project currently not to be edited)

In the capital market, credit risk is measured using ratings. The present project is looking for models that are statistically valid on the one hand and economically plausible on the other. The homogeneous Markoff process with constant migration intensities is an accepted description (Krämer et al. (2007)). Works such as Kiefer and Larson (2007), Weißbach and Dette (2007) and Weißbach et al. (2007), however, indicate inhomogeneities. Before an inhomogeneity can be assumed, it must first be clarified whether this is also economically relevant. For this purpose, the existing tests should be adapted for relevant differences or equivalence, see Munk and Weißbach (1999) or Weißbach and Hothorn (2002). Relevant inhomogeneous migration intensities can be estimated non-parametrically with empirical processes. In the simplest case of rating systems with only one solvent rating class, this is possible for the cumulative intensities using the Nelson-Aalen estimator. If the smoothness of the intensities can be assumed, Weißbach (2006) develops quite general statements of consistency for the estimation by core smoothing of this intensity. For the necessary bandwidth selection in this case, Weißbach et al. (2007) propose a data-adaptive method. However, there are no statements on consistency and bandwidth selection if the rating system, as usual, has several states. Subsequent questions such as For example, whether parametric alternatives are identified, or whether semi-parametric covariates help in an explanation in a Cox regression, are likely to exceed the scope of the project, but should be considered in advance.

The cooperation with mathematics must be expanded in this project, as Prof. Dette in particular has long documented expertise on empirical processes. In the well-established cooperation with banks, the data availability and the discussion about business implications are to be promoted in order to ultimately make a contribution to financial market stability.

#### literature

- Kiefer, N.M., and Larson, C.E. (2007): A Simulation Estimator for Testing the Time Homogeneity of Credit Rating Transitions. Journal of Empirical Finance 14, 818-835.
- Krämer, W., Caasjens, S., Kramer, F., Mollenhauer, R., and Walter, R. (2007): The optimal combination of internal and external ratings. In: Schimmelmann, W., and Franke, G. (Eds.): Internal and External Ratings. FAZ, Frankfurt a. M., 123-162.
- Munk, A., and Weißbach, R. (1999): 1-α Equivariant Confidence Rules for Convex Alternatives are α / 2-Level Tests - with Applications to the Multivariate Assessment of Bioequivalence. Journal of the American Statistical Association 94, 1311-1319.
- Weißbach, R. (2006): A General Kernel Functional Estimator with General Bandwidth - Strong Consistency and Applications. Journal of Nonparametric Statistics 18, 1-12.
- Weißbach, R., and Dette (2007): Kolmogorov-Smirov-type Testing for Partial Homogeneity of Markov Processes - with Application to Credit Risk. Applied Stochastic Models in Business and Industry 23, 223-234.
- Weißbach, R., and Hothorn, T. (2002): Assessing Equivalence Tests with Respect to their Expected p Value. Biometrical Journal 44, 1015-1027.
- Weißbach, R., Pfahlberg, A., and Gefeller, O. (2007): Double Smoothing in Kernel Hazard Rate Estimation. Methods of Information in Medicine, appears.
- Weißbach, R., Tschiersch, P., and Lawrenz, C. (2007): Testing Homogeneity of Time-Continuous Rating Transitions after Origination if Debt. Empirical Economics. In revision.

There are links to projects C2 (censored retention periods, Cox regression), C7 (structural break), C10 (credit portfolio risk) and C11 (Cox regression).

**Project C10: The Impact of Estimation Errors on Portfolio Credit Risk**

(Project cannot be edited at the moment)

Klein and Bawa (1976) documented that the estimation uncertainty has a noticeable effect on the return on portfolios. The impact on portfolio risk is therefore a natural subject of research. Triggered by the drastic events of recent years, credit risk has established itself as a research focus. Our aim is to investigate the estimation uncertainty in parameters of portfolio credit risk. Since Gordy (2000) has shown that common portfolio approaches can be converted into one another, these results do not depend on the approach used. The most important parameter of the credit risk is the rating (Krämer, 2005), for the credit portfolio return the model of the dependencies, usually a correlation matrix, limits diversification as a parameter (Bürgisser, 1999).

There are numerous studies on the estimation of rating models (see Weißbach et al., 2007, and references); I am not aware of an analysis of the effects of estimation uncertainties on loan portfolio measures. When estimating the dependency structure, the definition of the target variable is particularly controversial; Buergisser et al. (1999) use conditional failure probabilities, Rosenow and Weißbach (2005) and Weißbach and Rosenow (2007) consider failure rates.

First results on the impact of the correlation model on portfolio risk are available (Rosenow et al. (2004, 2007)).

One goal of the project is the transition from the primarily computer-aided approaches of assessment to analytical findings as already indicated in Heuer (2007). The latest analytical results from Höse (2007) are to be used for the simultaneous estimation of the dependency structure and the probability of default.

One possible simplification, in the case of correlation estimation such as Weißbach and von Lieres and Wilkau (2005, 2006), is initially limited to portfolios of non-performing loans.

In cooperation with international credit risk management and through interdisciplinary academic cooperation, this should contribute to financial econometrics.

#### literature

Bürgisser, P., Kurth, A., Wagner, A., and Wolf, M, (1999): Integration Correlations. Risk Magazine 12, 57-60.

Gordy, M. (2000): A Comparative Anatomy of Credit Risk Models. Journal of Banking and Finance 24, 119-149.

Heuer, C. (2007): Effect of the correlation coefficient and its estimate on economic capital. Diploma thesis, Technical University of Dortmund, Faculty of Statistics.

Höse, S. (2007). Statistical accuracy in the simultaneous estimation of dependency structures and default probabilities in loan portfolios. Dissertation at the Faculty of Economics at the University of Dresden.

Klein, R., and Bawa V. (1976): The Effect of Estimation Risk on Optimal Portfolio Choice. Journal of Finance Economics 3, 215-231.

Krämer (2005): On the Ordering of Probability Forecasts. Sankhyā 67, 662-669.

Rosenow, B., and Weißbach (2005): Conservative Estimation of Default Rate Correlations. In: Takayasu, H. (Ed.): Practical Fruits of Econophysics. Heidelberg, Springer, 272-276.

Rosenow, B., Weißbach, R., and Altrock, F. (2004): Modeling PD Correlation - with Application to CreditRisk +. SFB475 discussion paper, 5, Technical University of Dortmund.

Rosenow, B., Weißbach, R., and Altrock, F. (2007): Modeling Correlation in Portfolio Credit Risk II. SFB475 discussion paper, 6, Technical University of Dortmund.

Weißbach, R., and Rosenow, B. (2007): Smooth Correlation Estimation - with Application to Portfolio Credit Risk. In: Weihs, C., and Gaul, W. (Eds.): Classification: The Ubiquitous Challenge. Heidelberg, Springer, 474-481.

Weißbach, R., and von Lieres and Wilkau, C. (2005): On Partial Defaults in Portfolio Credit Risk - A Poisson Mixture Approach. SFB475 discussion paper, 6, Technical University of Dortmund.

Weißbach, R., and von Lieres and Wilkau, C. (2006): On Partial Defaults in Portfolio Credit Risk: Comparing Economic and Regulatory View. SFB475 discussion paper, 2, Technical University of Dortmund.

There are links to projects B6 (violation of independence), C1 (dimension reduction), C8 (diversification in large portfolios) and C9 (parameter rating).

**Project C11: Modeling the development of mesothelioma through fiber exposure**

**Prof. Dr. Joachim Kunert**

The proven harmfulness of asbestos to humans has meant that mineral insulation materials have to be examined for their effects. This investigation is carried out in animal experiments on rats.

The article by Rödelsperger (2004) triggered a discussion as to whether the results of animal experiments on rats on the toxicity of mineral fibers can even be transferred to humans. Rödelsperger (2004) refers to an essay by Berry (1999), which tries to model a model for the survival time up to the formation of mesothelioma from the properties of the substances, taking into account in particular the biological decomposition of the inhaled fibers in the lungs. Since the biological decomposition in the lungs proceeds at the same rate in rats as in humans, but the lifespan of humans is much longer, the application of the Berry model for predicting humans would mean that no substance with a finite decomposition time is still in humans leads to mesothelioma, even if it has been shown to have harmful effects in animal experiments on rats.

With asbestos fibers, the decomposition time is infinite. Therefore, the proven effect of asbestos on humans cannot be used as an argument. The project is intended to participate in this discussion by first investigating whether the Berry model can meaningfully describe the data from animal experiments. If animal experiments cannot be modeled well with this model, it certainly cannot be used for predictions in humans. Furthermore, alternative models for survival times are to be investigated and adapted to the animal data. What consequences would these models have for humans if they were extrapolated? Are there prognoses of animal experiments on humans that are robust against model variations?

#### literature

Bernstein, D.M., Riego Sintes, J.M., Ersboell, B.K., and Kunert, J. (2001): Biopersistence of Synthetic Mineral Fibers as a Predictor of Chronic Inhalation Toxicity in Rats. Inhalation Toxicology 13, 823-849.

Bernstein, D.M., Riego Sintes, J.M., Ersboell, B.K., Kunert, J. (2001): Biopersistence of Synthetic Mineral Fibers as a Predictor of Chronic Intraperitoneal Injection Tumor Response in Rats. Inhalation Toxicology 13, 851-875.

Berry, G. (1999): Models for Mesothelioma Incidence Following Exposure to Fibers in Terms of Timing and Duration of Exposure and the Biopersistence of the Fibers. Inhalation Toxicology 11, 111-130.

Rödelsperger, K. (2004): Extrapolation of the Carcinogenic Potency of Fibers from Rats to Humans. Inhalation Toxicology 16, 801-807.

There are links to the projects A2 (model robustness), C2 (survival times) and C9 (Cox regression).

**Project D1: Robust time series analysis**

**Prof. Dr. Roland Fried**

The statistical analysis of time series of continuous variables is mostly based on strong model assumptions, such as the existence of a Gaussian process (multivariate) of normally distributed variables. Since real data often deviate from such simplifying standard assumptions, in particular distributions with severe margins or outliers, consistency studies for traditional approaches based on quadratic or absolute deviations, likelihood or moments were tried in the past (Mikosch et al., 1995, Pan , Wang and Yao, 2007). On the other hand, various robust analysis approaches such as robust trend (Davies, Fried and Gather, 2004; Fried, Einbeck and Gather, 2007) and autocorrelation estimators (Masarotto, 1987; Ma and Genton, 2000), (generalized) M-estimators for ARMA -Models (Bustos and Yohai, 1986), and iterative procedures for outlier identification and elimination (Chen and Liu, 1993; Gather, Bauer and Fried, 2002) developed. The latter, however, assume a Gaussian process with only a few outliers as the basis. Robust approaches to the identification of a suitable model and robust parameter estimates are usually loosely side by side. What is desirable, on the other hand, are robust methods that are coordinated with one another, which enable the integrated analysis of contaminated data consisting of steps such as model selection, parameter estimation and model diagnosis including the calculation of reliable forecast intervals.

The aim of this project will be to develop a kit of compatible tools for robust time series analysis. For this purpose, existing methods from the literature must be viewed and compared, and instruments for various analysis steps must be coordinated. Here, analytical properties such as consistency under general assumptions, influence functions and maxbias curves as well as extensive simulation studies are used. Ultimately, the aim is to implement the best found and newly developed methods in generally available statistical software.

#### literature

Bustos, O.H., and Yohai, V.J. (1986): Robust Estimation for ARMA Models. Journal of the American Statistical Association 81, 155-168.

Chen, C., and Liu, L.M. (1993): Joint Estimation of Model Parameters and Outlier Effects in Time Series. Journal of the American Statistical Association 88, 284-297.

Davies, P.L., Fried, R., and Gather, U. (2004): Robust Signal Extraction for On-line Monitoring Data. Journal of Statistical Planning and Inference 122, 65-78.

Fried, R., Einbeck, J., and Gather, U. (2007): Weighted Repeated Median Smoothing and Filtering. Journal of the American Statistical Association 480, 1300-1308.

Gather, U., Bauer, M., and Fried, R. (2002): The Identification of Multiple Outliers in Online Monitoring Data. Estadística 54, 289-338.

Ma, Y., and Genton, M.G. (2000): Highly Robust Estimation of the Autocovariance Function. Journal of Time Series Analysis 21, 663-684.

Masarotto, G. (1987): Robust Identification of Autoregressive Moving Average Models. Applied Statistics 36, 214-220.

Mikosch, T., Gadrich, T., Klüppelberg, C., and Adler, R.J. (1995): Parameter Estimation for ARMA Models with Infinite Variance Innovations. Annals of Statistics 23, 305-326.

Pan, J.Z., Wang, H., and Yao, Q.W. (2007): Weighted Least Absolute Deviations Estimation for ARMA Models with Infinite Variance. Econometric Theory 23, 852-879.

There are links to projects B4 (robust model selection for linear time series models), B6 (dependency structures), C2 (outliers), C4 (robust time series analysis), C5 (robust modeling of dependencies), C7 (time series) and D3 (spectral analysis).

**Project D2: Problem-specific optimization of an ECOC class binarization for multi-class classification problems**

**Prof. Dr. Claus Weihs**

When generalizing binary classification methods to multi-class problems (see e.g. Szepannek et al., 2007, a popular example is the generalization of support vector machines to multi-class problems), there are often the approaches "one-against-all" (oaa) and " one-against-one "(oao) (for SVMs see e.g. Vapnik, 1995, Vogtländer and Weihs, 2000).

The ECOC approach represents a generalization (Dietterich and Baikiri, 1995): The original k-class problem is converted into n binary problems (by combining classes) (cf. Gebel and Weihs, 2007). Each class receives a 'code book entry' from the n binary class labels. A new observation is classified into the class with the most similar codebook entry. Kong and Dietterich show a reduction in bias by using the method. Allwein et al. (2000) show that both the oaa and the oao approach can be expressed using error correcting output codes.

The algorithms for generating the codebook vectors (ie the selection of the binary classification problems) of the classes have so far been aimed at achieving the greatest possible Hamming distance between the classes, ie ensuring the greatest possible differentiation between the various classes based on a prediction of the n binary classifiers (see Kuncheva, 2005).Little attention is paid here to the inclusion of the specific classification problem. The question arises as to which of the classes can be combined to enable a good binary classification. Pujol and Vitria (2006) present a first heuristic. Another potential solution approach is the criterion "ability to separate" by Garczarek (2004), which characterizes the separability of the classes by means of a classification rule. The aim and content of the research topic is the development of an efficient algorithm for generating code book vectors of the classes with regard to runtime (see e.g. Pumplün et al., 2005) and classification result.

#### literature

Allwein, E., Schapire R., and Singer, Y. (2000): Reducing Multiclass to Binary: a Unifying Approach for Margin Classifiers. Proceedings of the 17th International Conference on Machine Learning, 9 16.

Dietterich, T., and Baikiri, G. (1995): Solving Multi-class Learning Problems via Error Correcting Output Codes. Journal of Artificial Intelligence Research 2, 263-286.

Garczarek, U. (2004): Classification Rules in Standardized Partition Spaces. Dissertation, Faculty of Statistics, Technical University of Dortmund.

Gebel, M., and Weihs, C. (2007): Calibrating Margin-Based Classifier Scores into Polychotomous Assessment Probabilities. Proceedings of the GfKl meeting 2007 in Freiburg; appears.

Kong, E., and Dietterich, T. (1995): Error-Correcting Output Coding Corrects Bias and Variance. Proceedings of the International Conference on Machine Learning, 313-321.

Kuncheva, L. (2005): Using Diversity Measures for Generating Error-Correcting Output Codes in Classifier Ensembles. Pattern Recognition Letters 26, 83-90.

Pujol, O., and Vitria, J. (2006): Discriminant ECOC: A Heuristic Method for Application Dependent Design of Error Correcting Output Codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (6), 1007-1012.

Pumplün, C., Weihs, C. and Preußer, A. (2005): Experimental Design for Variable Selection in Data Bases. In: Weihs C., and Gaul, W. (Eds.): Classification - the Ubiquitous Challenge, Springer, Heidelberg, 192-199.

Szepannek, G., Bischl, B., and Weihs, C. (2007): On the Combination of Pairwise Locally Optimal Classifiers, In: Perner, P. (Ed.): Machine Learning and Data Mining in Pattern Recognition, Springer LNAI 4571, Heidelberg, 104-116.

Vapnik, V. (1995): The Nature of Statistical Learning Theory. Springer Publishing House, London.

Vogtländer, K., and Weihs, C. (2000): Business Cycle Prediction Using Support Vector Methods. Technical report 21/2000, SFB 475, Faculty of Statistics, Technical University of Dortmund.

There are links to projects B5 (classification procedure), C2 (classification with more than two classes), C4 (multi-class problem), D3 (classification with more than two classes) and D4 (quality of classification algorithms).

**Project D3: Cluster and Classification Methods in Spectral Analysis**

**JProf. Dr. Uwe Ligges**

The use of statistical spectral analysis (Walker, 1996; Bloomfield, 2000) is becoming more and more necessary in many areas of application. Analysis of image data, statistical methods in mass spectrometry (Massart et al., 1997), applications in econometrics and music analysis (Weihs et al., 2007) require statistical spectral analysis.

The object of the project is to adapt and optimize cluster and classification procedures (Hastie et al., 2001) for use in spectral analysis. For the use of cluster methods, suitable distance measures and methods are to be found that make it possible to group spectra better than before. In the field of music analysis, for example, clusters of similar sounding tones should be found (Weihs et al., 2005a, 2006). This is not possible with conventional distances (such as the Euclidean distance).

For classification processes, it is important to optimize existing processes for the large amount of data of typical spectral analysis applications and to robustize them because of the often very strong background noise. One of the obvious areas of application is the classification of multiple tones in polyphonic music time series for transcription (Ligges, 2006; Weihs and Ligges, 2005).

In addition to the development of suitable methodology, the implementation in algorithms and the sustainable implementation in software packages, for example as components of R (R Development Core Team, 2007; Ligges, 2007), packages such as klaR (Weihs et al., 2005b) or tuneR (Ligges , 2006), essential aspects.

#### literature

Bloomfield, P. (2000): Fourier Analysis of Time Series: An Introduction. 2nd Edition; Wiley, New York.

Hastie, T.J., Tibshirani, R.J., and Friedman, J. (2001): The Elements of Statistical Learning. Data mining inference and prediction. Springer Publishing, New York.

Ligges, U. (2006): Transcription of monophonic vocal time series. Dissertation, Faculty of Statistics, Technical University of Dortmund, http://hdl.handle.net/2003/22521.

Ligges, U. (2007): Programming with R. 2., revised and updated edition; Springer-Verlag, Heidelberg, ISBN 3-540-36332-7.

Massart, D.L., Vandeginste, B.G.M., Buydens, L.M.C., De Jong, S., Lewi, P.J., and Smeyers-Verbeke, J. (1997): Handbook of Chemometrics and Qualimetrics. Parts A + B; Elsevier, Amsterdam.

R Development Core Team (2007): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http: //www.R project.org.

Walker, J.S. (1996): Fast Fourier Transforms. 2nd Edition; CRC Press, Boca Raton.

Weihs, C., and Ligges, U. (2005): From Local to Global Analysis of Music Time Series. In: Morik, K., Boulicaut, J.-F., and Siebes, A. (Eds.). Local Pattern Detection, Lecture Notes in Artificial Intelligence 3539, Springer-Verlag, Berlin, 217-231.

Weihs, C., Reuter, C., and Ligges, U. (2005a): Register Classification by Timbre. In: Weihs, C., and Gaul, W. (Eds.). Classification: The Ubiquitous Challenge. Springer-Verlag, Berlin, 624-631.

Weihs, C., Ligges, U., Lübke, K., and Raabe, N. (2005b): klaR Analyzing German Business Cycles. In: Baier, D., Decker, R., and Schmidt-Thieme, L. (Eds.). Data Analysis and Decision Support. Springer-Verlag, Berlin, 335-343.

Weihs, C., Szepannek, G., Ligges, U., Lübke, K., and Raabe, N. (2006): Local Models in Register Classification by Timbre. In: Batagelj, V., Bock, H.-H., Ferligoj, A., and Ziberna, A. (eds.): Data Science and Classification, 315-322, Springer-Verlag, Berlin.

Weihs, C., Ligges, U., Mörchen, D., and Müllensiefen, D. (2007): Classification in Music Research. Advances in Data Analysis and Classification; Springer, Berlin (submitted).

There are links to the projects B5 (robust classification method), C1 (cluster and classification method for dimension reduction), C4 (music as an application area for the classification of spectra), C6 (clusters), D1 (spectral analysis), D2 (classification with more than 2 classes ) and D4 (numerical properties of the learning process on spectra).

**Project D4: Numerical properties of algorithms for statistical learning processes**

**JProf. Dr. Uwe Ligges**

On the way from statistical models to statistical algorithms, the numerical properties (Lange, 1999) of algorithms have to be taken into account if they are to be implemented in programming languages for digital computers (Knuth, 1998). There are two aspects to consider: accuracy and speed. The aim is to achieve a high speed with the highest possible precision, or to find a compromise for very computationally intensive processes.

In the area of classic linear models, extensive analyzes were carried out for the estimation with KQ methods and various algorithms were proposed, e.g. the problem of inaccurate inversion of the design matrix in poorly conditioned problems is avoided with the help of the QR decomposition. Updating algorithms have also been introduced if large amounts of data cannot be processed in one go.

In the area of statistical learning processes (Hastie et al., 2001) there are only corresponding numerical studies for a few common algorithms. In many cases, design matrices and, in particular, covariance matrices are inverted, which can lead to numerical problems. For example, in the case of quadratic discriminant analysis, regularization has been proposed to circumvent such problems (Friedman, 1989).

The aim of the project is to examine numerical properties of common algorithms for solving statistical learning processes and to suggest improvements if necessary. The numerous implementations of statistical learning processes in the statistical software R (R Development Core Team, 2007; Ligges, 2007), for example in the packages e1071 (Dimitriadou et al., 2006), klaR (Weihs et al, 2005), serve as a basis. MASS (Venables and Ripley, 2002), rda (Guo et al., 2005) and several others. The numerical instability and slowness of the algorithms for calculating naive Bayesian classifiers are often particularly noticeable. In particular, when analyzing the algorithms, integration into current research on classification methods for local models is planned (Weihs and Ligges, 2005; Weihs et al, 2006).

#### literature

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., and Weingessel, A. (2006): e1071: Misc Functions of the Department of Statistics (e1071). TU Vienna. R package version 1.5-16.

Friedman, J.H. (1989): Regularized Discriminant Analysis. Journal of the American Statistical Association 84 (405), 165-175.

Guo, Y., Hastie, T., and Tibshirani, R. (2005): rda: Shrunken Centroids Regularized Discriminant Analysis. R package version 1.0.

Hastie, T.J., Tibshirani, R.J., and Friedman, J. (2001): The Elements of Statistical Learning. Data mining inference and prediction. Springer Publishing, New York.

Knuth, D.E. (1998): The Art of Computer Programming. Addison-Wesley.

Lange, K. (1999): Numerical Analysis for Statisticians. Springer Publishing, New York.

Ligges, U. (2007): Programming with R. 2., revised and updated edition; Springer-Verlag, Heidelberg, ISBN 3-540-36332-7.

R Development Core Team (2007): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http: //www.R project.org.

Venables, W.N., and Ripley, B.D. (2002): Modern Applied Statistics with S. 4th edition; Springer, New York.

Weihs, C., and Ligges, U. (2005): From Local to Global Analysis of Music Time Series. In: Morik, K., Boulicaut, J.-F., and Siebes, A. (Eds.): Local Pattern Detection, Lecture Notes in Artificial Intelligence 3539, Springer-Verlag, Berlin, 217-231.

- Should I take an AP course online?
- Why is the geometric distribution important
- What does Fuori mean in Italian
- What makes a question interesting
- What does a 20-sided dice mean?
- When is Christmas Day
- Why is our immune system not attacking drugs
- How are permalinks stored in Wordpress
- Why are conservatives against globalization
- Will have verified WhatsApp accounts for businesses
- How do you learn to learn
- China was colonized
- What are some good naughty limericks
- Cow's milk takes longer to digest
- What is CTC Pay
- Is there a healthy type of cookie
- How can I sell my OneCoin
- What are the current health problems
- Non-California citizens can choose
- What companies are in the NYSE
- Sends Amazon products to Pakistan
- What is the dimension of the point
- What are the stages in NLP
- Would you wear tube jeans under skirts