data mining for bioinformatics at ewha cse
DESCRIPTION
Data Mining for BioInformatics at Ewha CSE. Dec. 14, 2001 Hwan-Seung Yong ( Gene: ACTGAAAGGGCTCTCAAA ) Dept. of Computer Science & Engineering Ewha Womans Univ. BioInformatics and Computer Science. Computer: 2 진법 시스템 (0/1) designed by Human Living things: 4 진법 (A/G/C/T) designed by Nature - PowerPoint PPT PresentationTRANSCRIPT
Data Mining for BioInformatics at Ewha CSE
Dec. 14, 2001
Hwan-Seung Yong
(Gene: ACTGAAAGGGCTCTCAAA)
Dept. of Computer Science & Engineering
Ewha Womans Univ.
BioInformatics and Computer Science
• Computer: 2 진법 시스템 (0/1) designed by Human• Living things: 4 진법 (A/G/C/T) designed by Nature• 컴퓨터 기술의 발전
– 데이터 분석 + 데이타베이스 = 데이터 마이닝 (At present)– 고성능 병렬 컴퓨터 기술– 분산 처리 및 웹 /X ML 기술– 지식관리 (Knowledge Management) 기술의 등장
• 인간이 컴퓨터를 만든 이유– 4 진법속에 담긴 생명의 비밀을 찾아서– 신의 영역에 도전
For BioInformatics
BioInformatics and Computer Science
• BioInformatics– DNA 코드 Reader(biotechnology) 및 Alignment 기술 개발
• 유전자의 전체 시퀀스를 겨우 만든 상태– 이것으로 부터 의미 ( 유전자 등 ) 를 찾는 것 .
– Binary Object 로 부터 Source Code 를 찾는 기술• Disassembler 와 Reverse Engineering 기술 전문가가 필요
– 데이타마이닝이 중요한 적용 기술임 .
Binary Code Assembly Code Source Code
DNA Sequence 유전자 단백질
Computer System
Living Things: Nature
Why Ewha CSE is appropriate for BioInformatics
• Recent focus of CSE’s Research Area – As a BK Project Plan: Knowledge Engineering Framework– Data Warehousing and OLAP – Data Mining– XML Technology– Knowledge Engineering Enabling Technology– Knowledge Engineering Application
• Electronic Commerce• BioInformatics
• 본교 관련 연구기관– 분자생명과학대학원 (BK)– 한국과학재단 SRC( 세포신호전달센터 )– 정통부 컴퓨터 그래픽스 / 가상현실 연구센터
• 기존의 관련연구 ( 직접 )– 검찰청 유전자 검색 및 자동분석 프로그램 개발– 국립과학수사연구소 유전자 정보 관리 시스템 개발
유전밴드 인식 , 코드 등록프로그램
유전밴드 인식 , 코드 등록프로그램
유전자 자동분석 프로그램
DNA Locus Registration Interface
Data Warehousing, OLAP and Data Mining
• Data Warehousing and OLAP – ETL Methodology (Extraction, Transformation and Loading)– Data Warehouse Architecture– OLAP Server Development– Multidimensional Data Processing– Metadata Handling– Data Quality Control
• Data Mining– Classification and Analysis of Data Minig Technique– Clustering Algorithm– Association Algorithm– Classification Algorithm– CRM Appliation based on Web Log Mining– Text Mining for XML Data
XML and Supporting Technology
• XML Related Area– XML Server Development
• Query Processing and Storage System
– XML document Mining
• Knowledge Enabling Technology– Multimedia Highspeed Network
– Component based Software Engineering
– Security
– Multimedia DBMS
– Natural Language Processing
– Computer Graphics and Virtual Reality
Research Requirement for BioInformatics
• Large Volume of Data including multimeia data• High Performace Computing System
– Massively Parallel Processing Hardware and Software
• XML related work is important– For exchange of bio data– Gene Annotation
• Web based collaborative system – Require web based interoperable application and standard – Distributed processing technique
• CORBA, SOAP, Microsoft .NET framework
• Data Mining– For Gene Prediction, Functional Genomics
Bio Data Mining Research
• XML Standard for Bio Data
• Graphical User Interface for XML Data
• Data Converter to XML – Convert Existing Bio Data to XML Standard
– Convert between Some XML Standard
• Integration Methodology with Existing DB– SOAP(Simple Object Access Protocol)
– WSDL(Web Service Description Language)
XML Standard for Bio Data
• Before– FASTA format, GenBank format, GFF(General Feature Format)
• XML Format– AGAVE (Architecture for Genomic Annotation, Visualization and
Exchange)• Developed by Double Twist, Inc.
• Released in June 2000
• Open Source licence in August 2001.
• AGAVE 3.2 version with Prophecy 3.0 in Sept. 2001
• Refer http://www.agavexml.org
• Genome XML Viewer by Labbook– BSML
XML standard for Bio Data
• BioXML Standard and GAME– an open-source/free software organization dedicated to providing a
set of standard xml formats for the exchange of biological data
• GAME(Genomic Annotation Markup Language)– Created at BDGP (Berkeley Drosophila Genome Project)
– Current Version 1.1 released in March 2000
– http://www.bioxml.org
– Follow WikiWeb scheme• collaborative web site that can be edited by anyone
• Community documentation system
• Everyone can edit sharing web pages
annotation
컴퓨터이론 및 보안 연구실
Phylogenetic Tree Visualization• Tree drawing algorithms• Graph drawing algorithms
New algorithm design•Simulated annealing•Other optimization techniques
Known gene • Sequence similarity
Unknown gene • Neural networks• Hidden Markov models
Unknown gene prediction
Microarray data analysis
Data mining tools
Two samples comparison
Clustering classification tools
Multiple samples comparison
Phylogenetic prediction
Phylogeny inferencePhylogenetic analysis
Comparative genomics
Whole genome sequence
Open Source Project
• Open BioInformatics Foundation– http://www.open-bio.org
– Umbralla group for various bio*.org group• bioxml.org, bioperl.org, biopython.org, biojava.org, biocorba.org
• biopathways.org
• bio-ensembl.org– Annotation for human genome
– The First Bioinformatics Open Source Conference (BOSC'2001) was held, August 2001 at San Diego.
– Many Open System Activities
Vision and Future Prediction
• Ewha will
– Contribute something in Bio Data Mining Area
– Have Bio Informatics Institute or Research Center
– Have strong bio-industry relationship
• Closing Comment
ATGCCGTCGGGCCCCGGGGC => Thank You 를 4 진법으로 표현