Knowledge Discovery in Social Sciences: A Data-Mining Approach (Spring 2018)

Led by Xiaoling Shu, this proseminar will introduce new developments in knowledge discovery and data mining to graduate students from various disciplines.

Spring 2018 | 10:00 a.m - 12:00 p.m. every other Tuesday (April 3, 17 & May 1, 15, 29) | L.J. Andrews Conference Room (2203 SS&H) | SOC 298 | CRN 79617 | Flyer | Syllabus (coming soon)


This proseminar will provide students with state-of-the-art knowledge to keep pace with the emergent research opportunities associated with big data, statistics and computer science, and apply these new perspectives and tools to social sciences issues. 

Only in recent years have social scientists started to use the new tools from data-mining science to advance their research and train their students. The availability of enormous amounts of data, from the Internet and data recording devices, has provided unprecedented opportunities to study human behaviors and attitudes. Researchers need new skills and knowledge in preparing, processing, and mining data to make new discoveries. Data mining is a multi-disciplinary field at the confluence of statistics, computer science, machine learning, artificial intelligence, database technology, and pattern recognition.

The seminar will focus on the following topics:

• Implications and role of data mining in the scientific research process;

• Data mining models and their approaches to causality;

• Application of data mining techniques to real data;

• Advantages and disadvantages of data mining compared to conventional statistical techniques;

• Data mining model assessment;

• Model selection and validation; and

• Contribution of newly discovered knowledge to our understanding of theory and concept regarding human behaviors and attitudes. 


The seminar provides new appreciation of fundamental issues in scientific research process, causality, the relationship between theory and data, on the significance of data mining approach different from statistical modeling, and on model assessment and validity, etc. It enables our graduate students to gain new knowledge, skills and insights on big data, data science, computational social science, statistics, AI, machine learning, and computer science. This experience and skill will strengthen their positions in the new job market with a large number of new academic jobs in the form of cluster hires in big data, quantitative methods, etc. 


This proseminar will meet five times in Spring Quarter 2018.

Week I explains the concepts and development of data mining and knowledge, and the role it plays in social science research. We will define Data Mining, Knowledge Discovery, Big Data, Computational Social Science and the key features of these concepts. We will provide information on the process of scientific research as theory-driven confirmatory hypotheses testing and the impact of the new approach of Data Mining and Knowledge Discovery on this process.

Week II deals with data preprocessing. It elaborates on data issues such as privacy, security, data collection, data cleaning, and missing data, as well as data transformation. It provides information on data visualization that includes graphical summary of single, bivariate, and complex data.

Week III devotes to the methods of unsupervised learning: clustering and associations. We will cover clustering that explains the types of clustering analysis, similarity measures, hierarchical clustering, and cluster validity. We further concentrate on the topic of associations, including association rules, usefulness of association rules, and local patterns vs. global models.

Week IV continues with another topic of machine learning: supervised learning that includes Bayesian methods, classification and decision trees, and neural networks. We will cover the Bayesian methods and regression, inductive machine learning, decision trees, and type of algorithms in classification and decision trees. We will further focus on the topic of neural networks that includes biological neurons and models, learning rules, and neural network topologies.

Week V ends the seminar with the topic on model assessment. We discuss and explain important model selection and model assessment methods and measures, such as cross-validation and bootstrap, as well as AIC and BIC indices. It provides justification as well as ways to use these methods to evaluate models.

Additional/backup topics: If time permits, we will focus on data mining with text data. We will elaborate on the link/network analysis that uses social network analysis, with various measures of centrality and prestige within the social network. We will also elaborate on topics of information retrieval and web search, link analysis, web crawling, opinion mining, and web usage mining.


Proseminar blog


1. Sequence Analysis & Optimal Matching Methods in Sociology

April 3, 2018 

Jingjing Chen & Madeline Craft

Before the workshop we were assigned two readings on Sequence Analysis (SA) on its newest development and promising opportunities for future research in social science. The first author of the paper (Abbott & Tsay, 2000) was Andrew Abbott, an American Sociologist and one of the first scholars to introduce SA to Sociology. This paper answered the most basic questions on sequence analysis, such as the types of research questions the method answers, forms of data it requires, various methods, techniques, and steps to conduct the analysis, as well as the advantages and weaknesses of SA.

A decade later, Aisenbry & Fasang (2010) discussed the recent development of SA on life course studies in Sociology and argued that the debate launched by Abbott & Tsay (2000) in Sociological Methods & Research triggered a second wave of SA with new technical implementations responding to the critiques of this method since 2000. Their paper focused on life course research and used synthetic example data to show how SA, especially optimal matching, contributed to a more holistic and trajectory-oriented perspective and broader applications of richer and multi-dimensional data analysis.

Dr. Tim Liao, a professor at the University of Illinois at Urbana-Champaign, gave a lecture on SA in the workshop. He first presented us with a state distribution plot and an entropy plot to enable us to envision the type of data that is appropriate for SA. He followed the plots with an example from Aisenbray & Fasang (2010), in which an individual’s life course was characterized by the states of having 0 children, 1 child, 2 children, etc. Using these characterizations, an individual’s life course could be represented by the following sequence of data: 0011112223333333. 

Dr. Liao next summarized the steps of SA using the Aisenbray & Fasang (2010) example. We were taught that the first step is to theoretically specify the state space and “cost” of transforming one sequence into another. One strategy for specifying the “cost” of sequence transformation is called optimal matching. The next step is to implement a method to produce pairwise distances between subjects, followed by scaling or clustering the distances, as outlined by Aisenbrey & Fasang (2010). The final step is to analyze the clusters. 

SA is a complicated statistical method which cannot be completely understood in just one workshop. However, we were given a broad overview of the topic and specific resources to further our understanding, such as the book entitled Advances in Sequence Analysis: Theory, Method, Applications. Dr. Liao also recommended the software packages such as TraMineR and TraMineRextra from R, and SADI and sq from Stata.