# Knowledge Discovery in Social Sciences: A Data-Mining Approach (Spring 2018)

### Spring 2018 | 10:00 a.m - 12:00 p.m. every other Tuesday (April 3, 17 & May 1, 15, 29) | L.J. Andrews Conference Room (2203 SS&H) | SOC 298 | CRN 79617 | Flyer

This proseminar will provide students with state-of-the-art knowledge to keep pace with the emergent research opportunities associated with big data, statistics and computer science, and apply these new perspectives and tools to social sciences issues.

Only in recent years have social scientists started to use the new tools from data-mining science to advance their research and train their students. The availability of enormous amounts of data, from the Internet and data recording devices, has provided unprecedented opportunities to study human behaviors and attitudes. Researchers need new skills and knowledge in preparing, processing, and mining data to make new discoveries. Data mining is a multi-disciplinary field at the confluence of statistics, computer science, machine learning, artificial intelligence, database technology, and pattern recognition.

The seminar will focus on the following topics:

• Implications and role of data mining in the scientific research process;

• Data mining models and their approaches to causality;

• Application of data mining techniques to real data;

• Advantages and disadvantages of data mining compared to conventional statistical techniques;

• Data mining model assessment;

• Model selection and validation; and

• Contribution of newly discovered knowledge to our understanding of theory and concept regarding human behaviors and attitudes.

**Interdisciplinarity**

The seminar provides new appreciation of fundamental issues in scientific research process, causality, the relationship between theory and data, on the significance of data mining approach different from statistical modeling, and on model assessment and validity, etc. It enables our graduate students to gain new knowledge, skills and insights on big data, data science, computational social science, statistics, AI, machine learning, and computer science. This experience and skill will strengthen their positions in the new job market with a large number of new academic jobs in the form of cluster hires in big data, quantitative methods, etc.

**Outline**

This proseminar will meet five times in Spring Quarter 2018.

Week I explains the concepts and development of data mining and knowledge, and the role it plays in social science research. We will define Data Mining, Knowledge Discovery, Big Data, Computational Social Science and the key features of these concepts. We will provide information on the process of scientific research as theory-driven confirmatory hypotheses testing and the impact of the new approach of Data Mining and Knowledge Discovery on this process.

Week II deals with data preprocessing. It elaborates on data issues such as privacy, security, data collection, data cleaning, and missing data, as well as data transformation. It provides information on data visualization that includes graphical summary of single, bivariate, and complex data.

Week III devotes to the methods of unsupervised learning: clustering and associations. We will cover clustering that explains the types of clustering analysis, similarity measures, hierarchical clustering, and cluster validity. We further concentrate on the topic of associations, including association rules, usefulness of association rules, and local patterns vs. global models.

Week IV continues with another topic of machine learning: supervised learning that includes Bayesian methods, classification and decision trees, and neural networks. We will cover the Bayesian methods and regression, inductive machine learning, decision trees, and type of algorithms in classification and decision trees. We will further focus on the topic of neural networks that includes biological neurons and models, learning rules, and neural network topologies.

Week V ends the seminar with the topic on model assessment. We discuss and explain important model selection and model assessment methods and measures, such as cross-validation and bootstrap, as well as AIC and BIC indices. It provides justification as well as ways to use these methods to evaluate models.

Additional/backup topics: If time permits, we will focus on data mining with text data. We will elaborate on the link/network analysis that uses social network analysis, with various measures of centrality and prestige within the social network. We will also elaborate on topics of information retrieval and web search, link analysis, web crawling, opinion mining, and web usage mining.

________________________________________________________________________________

## Proseminar blog

## 1. Sequence Analysis & Optimal Matching Methods in Sociology

### April 3, 2018

*Jingjing Chen & Madeline Craft*

Before the workshop we were assigned two readings on Sequence Analysis (SA) on its newest development and promising opportunities for future research in social science. The first author of the paper (Abbott & Tsay, 2000) was Andrew Abbott, an American Sociologist and one of the first scholars to introduce SA to Sociology. This paper answered the most basic questions on sequence analysis, such as the types of research questions the method answers, forms of data it requires, various methods, techniques, and steps to conduct the analysis, as well as the advantages and weaknesses of SA.

A decade later, Aisenbry & Fasang (2010) discussed the recent development of SA on life course studies in Sociology and argued that the debate launched by Abbott & Tsay (2000) in Sociological Methods & Research triggered a second wave of SA with new technical implementations responding to the critiques of this method since 2000. Their paper focused on life course research and used synthetic example data to show how SA, especially optimal matching, contributed to a more holistic and trajectory-oriented perspective and broader applications of richer and multi-dimensional data analysis.

Dr. Tim Liao, a professor at the University of Illinois at Urbana-Champaign, gave a lecture on SA in the workshop. He first presented us with a state distribution plot and an entropy plot to enable us to envision the type of data that is appropriate for SA. He followed the plots with an example from Aisenbray & Fasang (2010), in which an individual’s life course was characterized by the states of having 0 children, 1 child, 2 children, etc. Using these characterizations, an individual’s life course could be represented by the following sequence of data: 0011112223333333.

Dr. Liao next summarized the steps of SA using the Aisenbray & Fasang (2010) example. We were taught that the first step is to theoretically specify the state space and “cost” of transforming one sequence into another. One strategy for specifying the “cost” of sequence transformation is called optimal matching. The next step is to implement a method to produce pairwise distances between subjects, followed by scaling or clustering the distances, as outlined by Aisenbrey & Fasang (2010). The final step is to analyze the clusters.

SA is a complicated statistical method which cannot be completely understood in just one workshop. However, we were given a broad overview of the topic and specific resources to further our understanding, such as the book entitled *Advances in Sequence Analysis: Theory, Method, Applications*. Dr. Liao also recommended the software packages such as TraMineR and TraMineRextra from R, and SADI and sq from Stata.

## 2. Knowledge Discovery in the Social Sciences: An Overview

### April 17, 2018

*Courtney Caviness, Jared Joseph, and Joshua Hayes*

One of the primary motivations for pursuing Big Data (BD) and Knowledge Discovery in Database (KDD), and Data Science is the exponential increase in available data in present society. While data availability used to pose major hindrances to academic research, the growing challenge now is to make sense of the sheer glut of data in existence. This new challenge is reflected in rapidly growing segments of the economy, a wealth of new job opportunities, new academic fields (such as Bioinformatics), and right here at U.C. Davis a new Designated Emphasis (DE) in “Computational Social Sciences."

Researchers in these emerging fields must contend with competing definitions and blurry overlaps between concepts and terms as they work to refine and justify new techniques and methods. “Big Data” conventionally referred to data that was over a terabyte in size. In the past, such large amounts of data could not be contained in a single place, and therefore this size of data was essentially synonymous with non-traditional analyses techniques that involved interfacing multiple computers. Laney (2001) was the first to define the term in an academic publication and attempted to root the definition in “the three V’s”. That is, data that is: large in Volume, grows at a high Velocity, and has a great amount of Variety. A subsequent *Science*forum in 2011 popularized the term “Big Data.” McKinsey (2011) and Dumbill (2013) each made subsequent additions to the definition as the realization of the social, economic, and academic importance of big data grew. The latest definition, offered by Mauro et al (2016), broadens “Big Data” to refer to more than just data, but also the methods, technology, and social impact the data revolution has had on current society.

“Knowledge Discovery in Database” is a similarly slippery term. While on its face, the term seems to refer to inductive insights gleaned directly from data, Fayyad et al. (1996) define the term as a “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” That is, KDD is a dialectic process between inductive and deductive research programs that aims to *describe *observable data while also being able to *predict *patterns in future data. KDD is often misunderstood, or misrepresented, as exclusively data-driven and/or exploratory analyses. However, it more closely resembles Grounded Theory. Like all social science research, KDD requires conscious and critical reflection, combining field-specific knowledge and theory, to make informed social analyses.

The misunderstood nature of these two terms is a challenge for “Computational Social Sciences” (CSS) in general. As an interdisciplinary and newly emerging field, CSS is only just beginning to gain institutional credibility and understanding—though its potential is recognized almost universally. Part of the challenge is that CSS contains both a substantive and theoretical dimension, as well as an instrumental dimension. CSS scholars need to have deep knowledge of the social phenomena they hope to study, as well as having advanced skills for the analytical programs and coding necessary to work with big data.

The tension between theory and practice, inductive and deductive reasoning, and data-driven or theory-driven research agendas is common to every discipline, and CSS is no different. As scholars seeking to do work in this exciting new area, we will have to engage critically with these challenges.

## 3. Data preprocessing: avoiding data issues and working with cluster analysis

### May 1, 2018

*Lisa Huang, Savannah Hunter, Yining Malloch, and Erin Winters*

Professor Shu emphasized the importance of considering data privacy issues when working with big data. The code of ethics requires researchers to protect data privacy, data integrity, and human subjects’ rights to know how the data will be used. However, data mining techniques make it difficult to anonymize data. Also, one major data mining function is to classify or discriminate among people, and some discriminations can be unethical or illegal, such as discriminating based on race/ethnicity, sex, religion, and other protected statuses.

Because machine-learning strategies rely so heavily on the dataset, having clean, complete data is of great importance: “garbage in, garbage out” is very much the theme of data prep in machine learning. Unlike some other theory-driven statistical methods, machine learning is not robust to missing data. Because of this, it is necessary to either remove incomplete data (assuming data is missing at random) or use one of a few different methods for replacing missing values. These methods include using a measure of central tendency (although this will cause standard errors and confidence intervals to be too narrow), using a random value (although this adds to the “garbage in, garbage out” problem), or using multiple imputation based on other characteristics in the data. Multiple imputation is the best choice but also the most difficult. Choosing multiple imputation will also affect your standardization procedure. You will need to impute expected values multiple times and compute your standardized values each time and use an average in the final analysis.

In order to meaningfully analyze and interpret data, the measurement scales must be normalized in order to make the effect of each variable on the outcome equally considered. There are a number of normalizing procedures, some more complicated than others. One normalizing procedure is to transform variables into standardized Z– scores. Another, decimal scaling, transforms non-normal variables to match the normal distribution. Examples of decimal scaling include log-transformation, square-root transformation, and box-cox transformation. Another strategy is binning variables. For variables that are discrete or ordered, you can “bin” groups of values together to make meaningful discrete variables. These bins can be based on meaning, equal width, clustering, or predictive values. Like all things, the choices we make in preparing data for analysis require careful consideration. You should think about what you are trying to learn and what kinds of interpretations you are considering making before you decide on any one procedure.

Cluster analysis is another step in the data mining process and is a form of unsupervised learning. Cluster analysis separates data into groups which contain cases that are similar to one another but distinct from cases in other clusters. Cluster analysis is unsupervised, meaning that grouping information and decision rules are not provided by the researcher. Clustering algorithms identify clusters by maximizing the distance among the central nodes of different clusters and minimizing the distance among observations within a single cluster. One measure of distance is Euclidean distance which is simply the shortest distance between two observations that are plotted on a coordinate grid. Another method is the Manhattan distance which measures the distance between two observations along a horizontal or vertical path. Using a right triangle as an example, the distance between the points that define the two acute angles can be measured either by the length of the hypotenuse (i.e., the Euclidean distance) or by the sum of the lengths of the two legs (i.e., the Manhattan distance). Other distance measurements include the Jaccard distance which measures the proportion of non-shared features between observations, the cosine distance which can be used to analyze the similarity among text documents, and the edit distance which measures the similarity of sequences.

Garip (2016) demonstrates the power of cluster analysis in social science research to uncover the diversity in migrant experiences in the United States. Garip (2016) shows how the composition of Mexican migrants to the United Status has changed over time, identifying four distinct groups (i.e., “clusters”), each with unique social circumstances and migration occurring with specific time periods. Cluster analysis can assist social scientists in grouping individuals or cases into homogenous categories. However, as with any tool, our role as social scientists is to ensure that clusters are meaningful and interpretable.

## 4. A Map of Knowledge

### May 15, 2018

*Ipek Tugce Bahceci and Tanaya Dutta Gupta*

Dr. Zachary Pardos, an assistant professor at UC Berkeley in a joint position between the School of Information and the Graduate School of Education, gave a talk named “A Map of Knowledge.” He explained how a *connectionist* model is used to represent objects as a function of the frequency distribution of other objects also occurring in the same sequence segments, but with a hidden layer added to capture regularities. He also explained his own work on the class enrollments at UC Berkeley. He finished by stating that big data is not the phenomena, it is just the light to study the phenomenon of behavior using data science as instrument. Big data is the light and data science is the instrument, while behavior is the phenomenon.

Dr. Pardos started his talk by defining maps as 2D projections of information. For instance, a map is projection of geographical information onto a 2D space. A map of knowledge, similarly, is the mapping (projection) of a big behavioral data onto a vector space. Dr. Pardos then talks about two central questions related to these maps. First, what can sequences tell us about the semantics of their content? Second, how can maps be used to help achieve this goal? Dr. Pardos stated that big data of sequences can help us infer semantics of variables, just as we infer semantics of words in the context of sentences.

Dr. Pardos explained models of knowledge representation: Symbolic Model and Connectionist Model. Symbolic Models, such as Bayesian Knowledge Tracing Model, are used to evaluate cognitive and standardized tests such as the GRE. Yet, a symbolic model may not be tractable at times, especially when the data is too big with too many branches. Connectionist approach (Continuous/Distributed) is more useful in such situations (E.g. Hinton, Neural Networks). Connectionist approach helps us to work in a space of n-dimensions rather than 2-dimensions. Therefore, they are continuous vector representations. With this approach, not only can we find themes in the space, but we can query the space.

Dr. Pardos further discussed word embedding, that is mapping or embedding of words in vector space. Dr. Pardos explained the skip-gram model, which takes a word and projects it in n-dimensional vector space, and predicts the context. By predicting the context, it infuses the word with meaning. Dr. Pardos showed findings from his own project which explores the question, can course semantics be inferred from course enrollment sequences? Dr. Pardos demonstrated the application of course2vec, which uses context related data to capture analogies between courses. The findings indicate that semantic fidelity generated by training the model with course enrollment is more than the semantic fidelity generated from course descriptions. Therefore this approach can be useful for effective pathfinding in higher education.

## 5. Supervised Learning: Classification, Decision Trees, and Neural Network

### May 29, 2018

*Dylan Antovich, Yiwan Ye, and Konrad Franco*

**Unsupervised Learning**

*Cluster Analysis*

Cluster analysis attempts to group individual cases into relatively distinct classes whose members are similar across dimensions defined by a set of predictor variables. This differs from classification in that it is unsupervised: no outcome/class variable or training set is provided to the algorithm. The general approach is to minimize the inter-group similarity while maximizing the intra-group similarity. This similarity can be defined using several distance measures: simple Euclidean, Manhattan (grid-based distance), Jaccard (proportion of matches), cosine (proportion difference in written text based on word tokens) or edit (number of edits required to turn one string into another).

One common approach, K-means cluster analysis, chooses a predetermined number (K) of random centroid values, and through an iterative process, converges on a set of clusters. In hierarchical cluster analysis, a hierarchy of clusters is formed, which can range from one cluster encompassing all cases down to classes that only include a single case. Similarly, divisive clustering creates a branching hierarchy of clusters, splitting off the most dissimilar cases from the group in an iterative process. This clustering hierarchy can be displayed as a dendrogram. An alternative approach, agglomerative clustering, begins with each case as an individual cluster, combining cases in a recursive process until a criterion is met. A final approach is linkage functions, which optimize clusters using distances between extreme cases or group averages. In all cases a Pseudo *F*-test can be used to determine the optimal number of clusters (K). The appropriateness of the resulting clusters can be tested using measures of within-cluster cohesion and between-cluster separation, which can be summarized by a Silhouette value (values > .5, provide support for the presence of clusters). There are many applications in which non-parametric approaches such as cluster analysis would be valuable given the many assumptions that must be met for more traditional approaches to be valid.

Dr. Shu has applied cluster analysis in her research examining gender ideology. In a recent project, she identified several gender ideology classes using cluster analysis, then examined differences in the distribution of these clusters within different political/geographic areas.

*Associations*

Association discovery is a rule-based machine learning method for discovering interesting relations between variables in large databases. In our seminar Dr. Shu reviewed a market basket analysis, which is a popular application of the method. The example reviewed past research that attempted to discover regularities between products in large-scale supermarket transaction data recorded by point-of-sale systems. For example, the rule found in the sales data of a supermarket that if a customer buys diapers, they are likely to also buy beer. Such information can be used as the basis for decisions about marketing activities such as, promotional pricing or product placements. Moreover, this method of unsupervised learning “knowledge discovery” can be usefully applied to social science research since it may be preferable to logistic regression in cases with several variables that contain unordered categories.

One limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated, there is a large risk of finding many spurious associations. Many algorithms for generating association rules have been proposed. A priori algorithms use a “breadth-first” search strategy and are commonly used with categorical and numerical data. Some key concepts from this section of the seminar include: support, confidence, lift, and leverage.

**Supervised Learning**

*Decision Trees & Classifications *

Decision trees are supervised nonparametric way of classifying samples. This method uses “divide and conquer” mechanics - Chi-squared hierarchical automatic iteration detection (CHAID) - to divide sample into subgroups as distinctive as possible based on certain criteria. After that, the model will further divide each subgroup into more specific and distinct subgroups. According to Dr. Shu, the decision trees reiterate the divisions until the subgroups are almost homogeneous or group size is too small to be divided (< 20 cases). The method is highly recursive; thus, the divisions and pathways resemble tree branches. Other than CHAID, researchers can also use information gain or variance reduction method to decide which variables to use and where to split the samples. For instances, the information gain method aims to reduce the entropy of random variables to achieve homogeneity, whereas the variance reduction is similar to the method used in regression.

Like many other data mining method, decision trees are good at detecting peculiar patterns hidden among variables, hence it improves models’ goodness of fit. Another main advantage of decision tree is that decision tree is a nonparametric. The method has no assumptions about the space distribution and classifying scheme, making it very flexible for various type of datasets. In addition, decision trees method allows researchers to have certain control over the learning, such as specifying minimum observations in a node for splitting or the maximum number of splits. Lastly, Dr. Shu suggests that decision trees are an intuitive way to show the relationships between sample characteristics and probability of an outcome, making the models look more appealing to the general public compared to t-test or ANOVA test.

There is also an advanced version of decision trees called random forest. Random forest and decision trees share similar mechanics, but random forest calculates using a series of decision trees. In random forest method, decision trees are classified into general types, and random forest choose the type of “trees” that are most common or prevalent in the “forest.” Random forest has higher predictive power than a single decision tree. However, the random forest models are usually too large, too complex, and difficult to interpret why “trees” are classified in a certain way. Dr. Shu suggests that researchers need to be careful not to overfit their models when using decision trees or random forest, and researchers have to decide if their findings are local or universal. Nevertheless, for researchers who mainly use regression method, decision tree will be a good alternative tool to test their hypotheses.

*Artificial Neural Networks *

Artificial neural networks are biologically inspired and designed to simulate the way in which the human brain processes information. These networks gather their “knowledge” by detecting the patterns and relationships in data and learn (or are trained) through exposure to data points (i.e., observations). A simple artificial neural network is equivalent to a logistic regression. That being said, most of these networks perform much better than logistic regression. Each input node acts as a single variable and each input node passes its value to the hidden layer. At this layer there a weight is assigned. These weighted values are then added and transformed using a mathematical function, which produces an output.

Neural networks have advantages and disadvantages over other approaches. For instance, neural networks are quite robust with respect to noisy data and they possess superior predictive capacity compared with regression models and decision trees in generating accurate predictions. With the help of multilevel hidden nodes, neural network handles nonlinear relationships more efficiently and accurately than regressions. However, these models are relatively opaque to human interpretation, tend to overfit the data, and generate different models even when using identical training date set, variables, settings, and validation data set (this is because of its iterative learning procedure).