Browse > Article

Data Mining for High Dimensional Data in Drug Discovery and Development  

Lee, Kwan R. (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville)
Park, Daniel C. (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville)
Lin, Xiwu (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville)
Eslava, Sergio (GlaxoSmithKline, Research & Development, Data Exploration Sciences 1250 South Collegeville Road Collegeville)
Abstract
Data mining differs primarily from traditional data analysis on an important dimension, namely the scale of the data. That is the reason why not only statistical but also computer science principles are needed to extract information from large data sets. In this paper we briefly review data mining, its characteristics, typical data mining algorithms, and potential and ongoing applications of data mining at biopharmaceutical industries. The distinguishing characteristics of data mining lie in its understandability, scalability, its problem driven nature, and its analysis of retrospective or observational data in contrast to experimentally designed data. At a high level one can identify three types of problems for which data mining is useful: description, prediction and search. Brief review of data mining algorithms include decision trees and rules, nonlinear classification methods, memory-based methods, model-based clustering, and graphical dependency models. Application areas covered are discovery compound libraries, clinical trial and disease management data, genomics and proteomics, structural databases for candidate drug compounds, and other applications of pharmaceutical relevance.
Keywords
data mining; high dimensional data; genomics; proteomics; pharmacogenomics;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Agrawal, R., et al. (1995). Fast discovery of association rules. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining. (AAAI Press), 3-8
2 Breiman, L., Friedman, J., Olshen, R., and Stone, C.J. CART:Classification and Regression Trees.(Belmont, CA:Wadsworth Press)
3 Elder, J. and Pregibon, D. (1996). A statistical perspective on KDD, Advances in Knowledge Discovery and DataMining. U. Fayyad, et al eds. (Cambridge, MA:AAAI/MIT Press), 83-114
4 Glymour, C., Madigan, D., Pregibon, D., and Smyth, P. (1996). Data mining and statistics Communications of the ACM. 39, 35-41
5 Heckerman, D. (1996). Bayesian networks for knowledge discovery. In Advanced in Knowledge Discovery and Data Mining, U. Fayyad et al. eds. (AAAI/MIT Press), 273-305
6 Jain, A. N., et al. (1994). Compass: a shape-based machine learning too for drug design. Journal of Computer-Aided Molecular Design. 8, 635-652   DOI   ScienceOn
7 Lee, K.R., Lin, X., Park, D.C., Eslava S. (2003). Megavariate data analysis of mass spectrometric proteomics data using latent variable projection method. Proteomics. 3, 1680-1686   DOI   ScienceOn
8 Lin,X., Park, D.C., Eslava, S., Lee,K.R., Lam, L.H., and Zhu LA (2003). Making Sense of Human Lung Carcinomas Gene Expression Data: Integration and Analysis of Two Affymetrix Platform Experiments. Proceedings of Critical Assessment of Microarray Data Analysis (CAMDA03), Durham, NC, USA, 2327
9 Smyth, P. (1996). Clustering using Monte Carlo cross-validation. Proceedings of the 2nd International Conference on Knowledge Discovery andData Mining. (AAAI Press) 126-133
10 Muggleton, S., King, R., and Sternberg, M. (1992). Protein secondary structure prediction using logic. Protein Engineering. 5,647-657   DOI   PUBMED
11 Chatfield, C. (1995). Model uncertainty, data mining, and statistical inference. J. R. Statist. Soc. (A).158, 419-466   DOI   ScienceOn
12 Vohradsky, J. and Thompson, C.J. (1997). Identification of procaryotic developmental stages by statistical analyzes of two-dimensional gelpatterns. Electrophoresis 18,1418-1428   DOI   ScienceOn
13 Smyth, P., Heckerman, D., andJordan, M.I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation. 9, 227-269   DOI   ScienceOn
14 Moore, J.S., Parker J.S., Olsen, N.S., and Aune, T.M. (2002). Symbolic discriminant analysis of microarray data in automimmune disease. Genetic Epidemiology. 23,57-69   DOI   ScienceOn
15 Engels, M.F.M., Knapen, K., and Tollenaere, J.P. (2001). Approaches for Mining High-throughput Screening Data Sets. Paper presented on the 13th European Symposium on Quantitative Structure-Activity Relationships, Dusseldorf, Germany
16 Hennessy, D. et al. (1995). Induction of rules for biological macromolecule cystanization. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. (Menlo Park, CA:AAAI Press), 179-187
17 Mannila, H. (1996). Data mining: machine learning, statistics, and databases. Proceedings of the 19961ntemational Conference on Machine Learning, (San Mateo, CA: Morgan Kaufmann Publishers), also available on the Web at http://www.cs.helsinki.fi/-mannila
18 Fayyad, U.M., Piatetsky-Shapiro,G., Smyth, P., and Uthurasamy, R. (1996). Advances in KnOWledge Discovery and Data Mining. (Cambridge, MA: AAAI/MIT Press)
19 Mannila, H. and Toivonen, H. (1996). Discovering generalized episodes using minimal occurences. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining(AAAI Press), 146-151
20 Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staut, L., Botstein, D., and Brown, P. (2000). Identifying distinct set of genes with similar expression patterns via gene Genome Biology.shaving. Genome Biology. 1, 1-21   PUBMED
21 Tibshirani, R., Hastie, T., Botstein, D., and Brown, P. (2001). Supervised harvesting of expression trees. Genome Biology 2, 1-12
22 Breiman, L., Friedman, J., Olshen, R., and Stone, C.J. CART:Classification and Regression Trees.(Belmont, CA:Wadsworth Press)
23 Cook, D.J. and Holder, L. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research. 1, 231-255
24 Olaleye, D. and Tardiff, B.E. (2001). Practical Issues in and Applications of Clinical Data Mining. DrugInformation Journal. 35,791-808
25 Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, San Mateo. (CA: Morgan Kaufmann)
26 Beer, D. et al. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 8, 816-824   PUBMED
27 Smyth, P. and Goodman, R.M. (1992). An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering. 4, 301-316   DOI   ScienceOn
28 Decker, K.M. and Foccardi, S. (1995). Technology overview: a report on data mining. Technical Report CSCS TR-95-02. (Swiss Scientific Computing Center, Manno, Switwerland)
29 Fayyad, U.M., Haussler, D., and Stolorz, P. (1996). KDD for science dataanalysis: issues andexamples. InProceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. E. Simoudis and J. Han eds. (Menlo Park,CA:AAAI Press), 50-56
30 Piatetsky-Shapiro, G. and Frawley, W.J. (1991). Knowledge Discovery in Databases. (Cambridge, MA:AAAIIMIT Press)
31 Banfield, J. and Raftery, A. (1993). Model-based Gaussian and non-Gaussian Clustering. Biometrics. 49, 803-821   DOI   ScienceOn
32 Ai, C.S., Blower, P.E., and Ledwith, R.H. (1991). Extracting reaction information from chemical databases. In PiatetskyShapiro, G. and W. J. Frawley ,eds. Knowledge Discovery in Databases, (Cambridge, MA:AAAI/MIT Press), 367-381
33 Friedman, H.P. and Goldberg, J.D. (2000). Knowledge Discovery from Databases and Data Mining: New Paradigms for Statistics and Data Analysis? pharmaceutical Report.8(2), Biopharmaceutical Section, American Statistical Association
34 Lee, K.R., Lydick, E., Park, D.C., Lin, X. (2001). Exploratory Data Analysis of Irregular Patterns of Longitudinal Laboratory Data from Clinical Trials - A case study of liver function test. Proceedings of 10th World Congress on Medical Informatics, London, UK.873
35 Bahler, D. and Bristol, D.W. (1993). The induction of rules for predicting chemical carcinogenesis in rodents. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology (Menlo Park, CA:AAAI Press), 29-37
36 Michie, D., Spiegelhalter, D.J., and Taylor, C.C. (1994). Machine Leaming, Neural and Statistical Classification. (New York: Ellis Horwood)
37 Burr, T., Gattiker, J.R., and LaBerge, G.S. (2001). Genetic Subtyping using Cluster Analysis. SIGKDD Explorations. 3, 33-42   DOI