报告题目:Learning Protein Structural Fingerprints under the Label-Free Supervision of Domain Knowledge
报 告 人:崔学峰 研究员(清华大学)
报告时间:2019年5月13日下午15:00
报告地点:南一楼中311室
邀 请 人:潘林强
报告人简介:
崔学峰:清华大学交叉信息研究院特聘研究员。本科、硕士和博士均毕业于加拿大滑铁卢大学。博士导师为加拿大基拉姆奖(Killam Prize,加拿大最高科研奖)得主、加拿大皇家科学院院士、ACM院士、IEEE Fellow李明教授(University Professor)。在沙特阿拉伯阿卜杜拉国王科技大学完成博士后研究。主要研究领域为生物信息学。致力于设计机器学习与并行算法,解决与人类生活息息相关的生物问题。以第一作者在生物信息学顶级会议Intelligent Systems for Molecular Biology(ISMB,每年仅录取约40篇论文)上发表文章3篇。科研成果多次被Bio-Techniques和 Science X等国际媒体报道。
报告摘要:
Finding homologous proteins is the indispensable first step in many protein biology studies. Thus, building highly efficient “search engines” for protein databases is a highly desired function in protein bioinformatics. Here, we propose a novel ContactLib-DNN method to quickly scan structure databases for homologous proteins. The core idea is to build structure fingerprints for proteins, and to perform alignment-free comparisons with the fingerprints. Specifically, the fingerprints are low-dimensional vectors representing the contact groups within the proteins. Notably, the Cartesian distance between two fingerprint vectors well matches the RMSD between the two corresponding contact groups. This is done by using RMSD as the domain knowledge to supervise the deep neural network learning. When comparing to existing methods, ContactLib-DNN achieves the highest average AUROC of 0.959. Moreover, the best candidate found by ContactLib-DNN has a probability of 70.0% to be a true positive. This is a significant improvement over 56.2%, the best result produced by existing methods.