Knowledge Discovery in Social Sciences: A Data Mining Approach (Spring 2018)

Led by Xioaling Shu, this proseminar will introduce new developments in knowledge discovery and data mining to graduate students from various disciplines.

Spring 2018 | 3:10 - 5:00 p.m. every other Tuesday (April 3, 17 & May 1, 5, 29) | L.J. Andrews Conference Room (2203 SS&H) | SOC 298 | CRN 79617 | Flyer (coming soon) | Syllabus (coming soon)


This seminar will provide students with state-of-the-art knowledge to keep pace with the emergent research opportunities given rise by a rapid growth of large data and a multi-disciplinary field of statistics and computer science, and explore these new perspectives and tools on social sciences issues. Only in recent years have social scientists started to use the new tools from data mining science to advance their research and train their students. The availability of enormous amount of data from the Internet and data recording devices has provided unprecedented opportunities to study human behaviors and attitudes. Researchers need new skills and knowledge in preparing, processing, and mining data to make new discoveries. Data mining is a multi-disciplinary field at the confluence of statistics, computer science, machine learning, artificial intelligence, database technology, and pattern recognition. The seminar will focus on the following topics:

• Implications and role of data mining in the scientific research process;

• Data mining models and their approaches to causality;

• Application of data mining techniques to real data;

• Advantages and disadvantages of data mining compared to conventional statistical techniques;

• Data mining model assessment;

• Model selection and validation; and

• Contribution of newly discovered knowledge to our understanding of theory and concept regarding human behaviors and attitudes. 


The seminar provides new appreciation of fundamental issues in scientific research process, causality, the relationship between theory and data, on the significance of data mining approach different from statistical modeling, and on model assessment and validity, etc. It enables our graduate students to gain new knowledge, skills and insights on big data, data science, computational social science, statistics, AI, machine learning, and computer science. This experience and skill will strengthen their positions in the new job market with a large number of new academic jobs in the form of cluster hires in big data, quantitative methods, etc. 


This proseminar will meet five times in Spring Quarter 2018.

Week I explains the concepts and development of data mining and knowledge, and the role it plays in social science research. We will define Data Mining, Knowledge Discovery, Big Data, Computational Social Science and the key features of these concepts. We will provide information on the process of scientific research as theory-driven confirmatory hypotheses testing and the impact of the new approach of Data Mining and Knowledge Discovery on this process.

Week II deals with data preprocessing. It elaborates on data issues such as privacy, security, data collection, data cleaning, and missing data, as well as data transformation. It provides information on data visualization that includes graphical summary of single, bivariate, and complex data.

Week III devotes to the methods of unsupervised learning: clustering and associations. We will cover clustering that explains the types of clustering analysis, similarity measures, hierarchical clustering, and cluster validity. We further concentrate on the topic of associations, including association rules, usefulness of association rules, and local patterns vs. global models.

Week IV continues with another topic of machine learning: supervised learning that includes Bayesian methods, classification and decision trees, and neural networks. We will cover the Bayesian methods and regression, inductive machine learning, decision trees, and type of algorithms in classification and decision trees. We will further focus on the topic of neural networks that includes biological neurons and models, learning rules, and neural network topologies.

Week V ends the seminar with the topic on model assessment. We discuss and explain important model selection and model assessment methods and measures, such as cross-validation and bootstrap, as well as AIC and BIC indices. It provides justification as well as ways to use these methods to evaluate models.

Additional/backup topics: If time permits, we will focus on data mining with text data. We will elaborate on the link/network analysis that uses social network analysis, with various measures of centrality and prestige within the social network. We will also elaborate on topics of information retrieval and web search, link analysis, web crawling, opinion mining, and web usage mining.