Social Network Analysis for Social Scientists (Spring 2017)

This proseminar, co-led by ISS Fellows Chris Smith and Robert Faris, provided a broad overview of social network analysis, beginning with basic network visualization and customization and concluding with advanced modeling and the visualization of complex data.

ISS Fellows: Chris Smith (Sociology) and Robert Faris (Sociology).

Social network analysis (SNA) is a method for investigating social structures through the use of network and graph theories. It is used across a wide range of disciplines, from biology to sociology. The techniques covered in this proseminar will be applicable to any number of data types and disciplines. In addition to the primer on social network data, visualization, and analysis, this proseminar will rely heavily on the use of the free statistical and graphical platform R.

SOC 298 | Wednesdays, 3:40 - 5:40 p.m. | Andrews Room, 2203 SS&H | Spring 2017 | CRN 89045 | Flyer | Syllabus

 

Topics will include:

-       Introduction to SNA and its application across the social sciences and beyond

-       R: installation, syntax, etc.

-       Relational Data (including common data structures, including sociomatrices, edgelists, and affiliation data)

-       Basic Visualization (including techniques for customization)

-       Graph-Level Indices (components, density, centralization measures)

-       Node-Level Indices (degree, k-core, distance, betweenness)

-       Two-Mode Networks

-       Exponential Random Graph Models (ERGM) (parameters, convergence, goodness of fit)

-       Advanced Modeling, including Stochastic Actor-Oriented Models (R-Siena)

-       Big Data and Advanced Visualization

_____________________________________________________________________________

 

Proseminar blog

 

1. Introduction

April 5, 2017

The students in our group represent departments and graduate groups including: Agriculture and Resource Economics (ARE), Anthropology, Communication, Geography, Linguistics, Political Science, Psychology, and the Graduate School of Management.

After introducing ourselves, our research, and our experience with R coding (ranging from never downloaded the software to using it for years), our fearless leaders taught us the nuts and bolts of social network analysis. Using example networks of friendships, criminology, and sexually transmitted diseases, we learned SNA keywords including: nodes, edges, components, and isolates (Figure 1).

We then played Six Degrees of Kevin Bacon. Try it out: go to Google and type, “Bacon Number” followed by a name. The challenge is to try and find someone who has a Bacon Number of more than 4. We succeeded in finding two such names in our class: Adolf Hitler and Ivanka Trump. In the case of Ivanka, for example, she and Jamie Johnson appeared in Born Rich together; Jamie Johnson and Paul Weaver appeared together in Arbitrage; Paul Weaver and Sarah Jessica Parker appeared together in New Year’s Eve; and Sarah Jessica Parker and Kevin Bacon appeared together in Footloose

Following a review of the seminar’s schedule, we had lab time to download and/or update RStudio, work through some syntax and basic R coding, and install and/or update the Statnet package, which we’ll be using later in the course. RStudio pro-tip from class: you know how lines in the script can be never-ending in length? You can change your settings to always wrap the text lines (changes with resizes of your script window). To make this change go into the Tools Dropdown Menu. Then click on “Code” and check the box next to “Soft-wrap R source files”. Boom! See you next week!

References

Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris (2003). statnet: Software tools for the Statistical Modeling of Network Data. URL http://statnetproject.org

Rstudio Team (2016). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA. URL http://www.rstudio.com/

 

2. Relational Data

April 12, 2017

This week’s seminar introduced students to relational data. We started by reviewing nodes and description of undirected and directed ties. Undirected ties are represented by lines between nodes, where there is no distinction between nodes (ties are present or absent). Directed ties are represented by arrows and indicate which node is the sender and which node is the receiver. For example, if the arrow points from node A to node B (Figure 2a), node A is the sender; if the arrow points from node B to node A (Figure 2b), node A is the receiver; and if the tie is double-headed and points to both node A and node B (Figure 2c), both nodes are receivers and senders. Directed ties can be either asymmetric (one sender, one receiver), mutual (two senders, two receivers), or absent.

We were then introduced to data structures used in relating information for social network analysis: the sociomatrix and edgelists. The sociomatrix is square with nodes listed as row names and column names. The nodes must be in the same order. Each cell in the sociomatrix indicates the presence (1) or absence (0) of a tie. Because the sociomatrix is square, the diagonal in the matrix will consist of 0s, showing that the node is not connected to itself. Also important to note in the sociomatrix (for directed ties) is that rows represent the senders and columns represent the receivers. So, for example, if node A and node B are a mutual directed tie, in the sociomatrix, the cells {(A,B), (B,A)} will both have a 1. The other data structure covered was the edgelist which is a series of rows, each one representing one tie in the network. The edgelist has two columns, Sender and Receiver. As in the previous example, if node A and node B are a mutual directed tie, the edgelist will contain two rows, the first ordered (A,B) and the second ordered (B,A). In cases of undirected ties, the order of nodes in the edgelist does not matter. 

We also touched briefly on attributes that can be assigned to the social network, noting that the most important piece of the attribute table is assigning unique IDs that correspond to the edgelist ties. Attributes must be in the same order as the edgelist information. Finally, we talked about where to find social network data, which is basically everywhere! Examples include: the internet, observations, surveys, archival records, firm rosters, official police records, etc. After our lecture, we worked through a lab, practicing networking in R using both sociomatrices and edgelists. R-studio pro-tip of the week: You know how you make comments in R scripts using “#” and if you put four #, followed by your comment, followed by four #, it creates a little drop down arrow so you can collapse or expand code? Well, did you know there’s an outline feature in R-studio?! In your R-studio window, at the top of your script, click on the little outline icon and an outline with your nested headings/notes will pop out (Figure 3). Click on a heading to automatically jump to that section of your script. AHH-mazing!

Student Spotlight

This week we also started student and faculty spotlights. Matt Thompson, a graduate student in Sociology, presented briefly on work relating to institutional nomination data. He created networks of colleges and universities using publically available information and data that included attributes such as geography, size, race, gender, ranking, etc. I don’t want to give away all of Matt’s secrets, so if you’re interested in the awesome stuff he’s doing, you can contact him here: mthomp@ucdavis.edu

 

3. Two-Mode Networks

April 19, 2017

This week, we continued exploring and practicing network analysis, moving from one-mode networks, such as connections between people, to two-mode networks, such as connections between people (mode 1) present at the same event (mode 2). Examples of a two-mode network are the connections between kids at a birthday party; between states in an international crisis; or between people arrested together in the same police operation. In a two-mode network we imply that a co-presence in the same event means that the actors are connected. In other words, people are connected to events and events are connected to people. In the two-mode networks, the edge, or connection, between two actors exists only if they were present in the same event. Be aware, however, that there is a potential weakness in this approach: if, for example, you and I were in the same class this week, this does not necessary mean that we get acquainted; if you and I were not in the same class this week, it still can be the case that we are friends. Despite this potential weakness of implying that co-presence at an event == connection, in many cases there is simply no better, practically feasible way to identify connections between actors of interest.

To make the two-mode data useful, we also learned how to make projection from two-mode data to one-mode data, and back. As mentioned before, in a two-mode network we have people connected to events, and events connected to people. When we make a projection to a one-mode network, we can choose to portray a network of people connected to people; or a network of events connected to events, or both. Our Professor demonstrated us this projection by using examples from her own research and data on a network of co-arrests.

We also practiced making the projections in R by using igraph package. Following a very detailed and friendly R script (even for R novices), we converted raw data on actors and events into two-mode network, projected it into a one-mode network of actors and a one-mode networks of events, and visualized each of the networks to see how actors and events are inter-connected (two-mode network), how actors are connected (a one-mode network), and how events are connected (a one mode network). All this was done by using only ten rows and two columns of raw data. All events and actors in that data had no name or any empirical identification, so everyone in our multi-disciplinary class could apply the things we learned to his or her own research. Network analysis might seem complicated, but we are now in only third week and can make some of the magic ourselves!

 

4. Visualization

April 26, 2017

This week’s seminar was all about the visualization: colors, shapes, widths, line types, sizes, background colors, legends, etc. There are a ton of customizable features in social networks and this week’s seminar covered a lot of them! Some helpful tips for visualization:

1)    Aim for readability and parsimony

2)    In large networks, minimize details

3)    Minimize edge and node overlaps

4)    Be aware that node placement is not always useful

5)    Be aware the edge lengths are not always useful

In our discussions, we looked at both one two mode networks, noting that customization applied to two mode networks can be carried over into one mode projections (Figure 1).

In lab, we worked primarily with igraph, but we also tried our hand at customization in statnet. Using a variety of data sources (some supplied by our fearless leaders and some from our own research experiences), we held an ugly and pretty network competition. Interestingly enough, the majority of the class entered the ugly competition but backed away from the pretty competition. What does this mean!? What does it say about our class or networking in general? Are we afraid that our ‘pretty’ networks will be ugly to others? Or is it that we just enjoy making ugly networks more (plus it’s easier)?! That’s a discussion for another time. For now, take a look at some of our ugly networks. Who do you think wins the prize?

This week’s R pro-tips:

Working with visualization and need some color ideas? Check out these sites:

            http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

            http://research.stowers.org/mcm/efg/R/Color/Chart/ColorChart.pdf

            https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf

 

5. Graph-Level Indices

May 3, 2017 

This week’s seminar discussed using graph-level indices as a means for comparing networks either over time or against each other. There are different metrics on which to base comparisons of networks. The three we covered in seminar are:

1)    Size

2)    Components

3)    Density

The easiest metric is size, which requires knowing the number nodes in the network. This can be easily obtained through summary statistics in R. The ‘components’ metric considers either a count of the number of components in the network or the size of the components in the network. Finally, the ‘density’ metric considers the connectedness of the network. It calculates how dense or sparse the network is using a comparison between the actual connections present and the total number of connections possible. For an undirected network, for example, the total number of possible connections is calculated using the following equation: nx(n-1) / 2. By using density, you can better compare networks of different sizes. Figure 1 shows two networks, each with the same density.

We also talked about centralization of networks. Degree centralization considers how central the most central node is compared to all other nodes in the network. Using degree centrality, we can calculate a single statistic for the whole network. Before heading into lab to practice doing some of these calculations in R, we had another guest lecturer!

Research spotlight

Today’s guest was Dr. Cuihua (Cindy) Shen, an associate professor in the Department of Communication at UC Davis. Cindy talked about the use of social network analysis in online worlds. Specifically, she and her colleagues are interested in studying context collapse and its effects on people’s self-presentation strategies. To do this work, they partnered with a Facebook app, myPersonality. This app, created by Stanford students, was set up so that users agreed to donate their Facebook data in exchange for getting a ‘personality reading’. The research is informed by Communication Accommodation Theory (CAT), which postulates that people adjust their language when they interact with different networks or groups of people. Given that people have several networks on Facebook, Dr. Shen and her team delve into how people ‘speak’ on Facebook by looking at the language of their status updates. Want to dive further into the Facebook world and language research?! Contact Dr. Shen at cuishen@ucdavis.edu. For more information on the myPersonality app, see their website.

 

6. Node-Level Indices

May 10, 2017

This week, we dove into node-level indices. Specifically, we discussed five properties of node-level indices:

1)     Degree

2)     K-core

3)     Distance

4)     Betweenness

5)     Neighborhood

First, the ‘degree’ considers where the action is in the network, who is the most central actor, and who is popular versus not popular. The degree score for undirected networks is the count of ties from each node. As such, every node in a network has a degree score. Histograms are an easy way to visualize the different degree scores found in a network (Figure 1).

In comparison, the k-core is a bit more complicated. It focuses on dense pockets of cohesion or groups in the network that have the same minimum degree score. For example, in Figure 2, the green cluster has a k-core of 3 because the minimum degree score in this cluster is 3. Node degree scores and K-cores relate in that nodes with high degrees can have high k-cores (but they don’t have to). Nodes that have low degree scores will never have high K-cores.

The third property is distance. This refers to how far apart individuals are in the network or how far something must travel to get somewhere else. Every node has a geodesic distance to all other nodes (not itself). Typically, when analyzing networks, we look for the distance that is most efficient. In other words, we look for the shortest path possible between two nodes. There can be multiple shortest distance paths between nodes. However, geodesic distances are only possible in a single component. You cannot measure a geodesic distance across network components.

Next, we covered betweenness. Betweenness considers who is on the path between two nodes. It is calculated using the number of geodesic distances that a node is on and it can help highlight which nodes in the network act as ‘brokers’. Brokers are important because they bridge major pieces of the network and prevent ‘structural holes’ (Figure 3).

Finally, we considered neighborhoods. Neighborhoods consider connections to nodes through an increasing number of steps. For instance, who is connected to a node in one step, who is connected to a node in two steps, etc.

Our research spotlight today was our very own, Dr. Chris Smith, one of two fearless leaders of the SNA Pro-Seminar this quarter!

Research Spotlight

Dr. Chris Smith is an Assistant Professor in the Department of Sociology at UC Davis. She is also one of the profs leading the Social Network Analysis Pro-Seminar covered in this blog. Dr. Smith’s research focuses on crime, criminal relationships, and criminal organizations. In class, she presented work on how the structure of relationships contributes to power consolidation during exogenous events (such as prohibition). Within this work, she highlighted different brokering and distance measures related to gender. She found a higher gender gap ratio during pre-prohibition than prohibition times. Learn more about the fascinating world of organized crime, gender, and violence by contacting Dr. Smith at chmsmith@ucdavis.edu

 

8. Exponential Random Graph Models with Carter Butts

May 23, 2017

This week’s pro-seminar was a guest lecturer: the man, the myth, the legend… the original designer of the Statnet package: Dr. Carter Butts (buttsc@uci.edu)! Dr. Butts is a professor in the Departments of Sociology, Statistics, and Electrical Engineering and Computer Science, and the Institute for Mathematical Behavioral Sciences at the University of California, Irvine. He received his B.S. from Duke University and his M.S. and Ph.D. from Carnegie Mellon University. Dr. Butts serves as an area editor for Computational and Mathematical Organization Theory, on the editorial board for the Journal of Mathematical Sociology and on the board of reviewing editors for Science. He has a long list of publications as well as grants and fellowships awards from organizations such as the National Science Foundation, the National Institute of Health, and the National Oceanic and Atmospheric Administration. Needless to say, Dr. Butts is quite the accomplished researcher. Check out more details on Dr. Butts in his mildly outdated (his words, not mine!) CV or on his Research Webpage. 

Dr. Butts joined us to talk about Exponential Random Graph Models, more affectionately known as “ERGMs”. We started off with an introduction to why ERGMs are important. A lot of Social Network Analysis (SNA) work tries to understand and describe the structure networks, including asking questions like:

  • What’s important about the network? How do we quantify it?
  • How do we test hypotheses?
  • How do different structures form?
  • What causes structures to change?

To answer these and other questions, we need to employ the use of models because they substantiate theory (aka models embody assumptions about the world). There are a variety of models that can be used to try to understand networks, including conditional uniform graphs, bernoulli graphs, growth models, degree distribution models, etc. Dr. Butts pointed out that a common misconception in this realm is that ERGMs are a type of model. In reality, ERGMs are a representation for models. They are a way of writing down and working with new and existing models. 

In framing our discussion of ERGMs, Dr. Butts ran the class through an exercise of talking through how we would conceptually model a network. We talked through what factors would be important in tie formation and how we may predict ties. This led to a technical (for most of us anyway!) run-through of logistic regression and ultimately, how to move beyond logistic regression. The latter step is necessary for a number of reasons including: i) logistic regression can’t model conditional dependence among the edges and ii) logistic regression can’t handle exotic support constraints (like transitive networks where i entails j and j entails k, so i entails k). We then went through some background mathematics relating to exponential families for random graphs with covariates. I will spare you the details of that discussion and instead point you to the statnet wiki page for all things related to these discussions including: tutorials, papers, R packages, presentations, etc.

After enlightening us on the basics of ERGMs and their usefulness, we moved on to applied work using the ERGM tutorial. As a quick reminder, ERGMs are useful for many things, including:

  • Obtaining maximum-likelihood estimates for parameters of a model
  • Testing individual models for goodness-of-fit
  • Performing model comparisons
  • Simulating additional networks with probability distributions implied by a model

We thank Dr. Carter Butts for imparting his wisdom on our seminar class!

 

9. Visualization with Tarik Crnovrsanin

May 23, 2017

This week’s proseminar focused on advanced visualization and was taught by “Soon-to-be Dr.” Tarik Crnovrsanin. Tarik is finishing his PhD in computer science at the University of California, Davis. Tarik’s work includes networks, movement, radio sensors, and blinking. He has a list of publications on visualizations ranging from layout methods for online dynamic graphs to egocentric storylines. To view Tarik’s list of publications, visit his website. For more information, you can email Tarik at: tecrnovr@ucdavis.edu.

One of the foundations of visualization is understanding that it is an umbrella field. It requires knowledge from other fields of study. Having knowledge about the underlying data and relationships being depicted informs the visualization presented. Tarik began the class by showing a high school friendship network to illustrate the importance of visualization. He noted that the goal of visualization is to lead viewers/readers/users down a path that hopefully comes to the same conclusion as you. In other words, without telling them what the conclusion is, they should be able to get to it through your visualization (and ideally, only the conclusion(s) that you want them to make). If viewers/readers/users can come to the conclusion on their own, it is more believable.

Building off these ideas, Tarik then led the class through several visualizations. For each visualization, we had to work through four components:

  • Identifying visual encoding (i.e., the nodes, edges, colors, sizes, shapes, etc.)
  • Identifying the purpose of the visualization
  • Identifying any problems with the visualization
  • Identifying how to improve the visualization

This helped us get a feel for how useful different visualizations are, as well as gain a deeper understanding of leading viewers towards a conclusion or conclusions. For example, see Figure 1. This network shows circular nodes that are colored based on student grade level. Edges are colored based on the sender node. The purpose is to show relationships between students. It clearly shows the distinct clusters of students by grade, but also shows that students in grades 11 and 12 tend to be more intermingled than other grades, especially 8th.

As far as any problems with the network and how to improve it, you could change the ‘-1’ value so that it’s clearer what that represents (unknown grade). Also, you could change the shape of the nodes to male/female to try to get more information out of the image. As a researcher, the question(s) you are trying to answer will greatly impact the worth of the visualization. You want to provide enough information but not too much.

After going through several network examples, Tarik summarized the use of visualization quite nicely with these tips:

  1. Good visualization comes with experience, there is no magic bullet
  2. Look to what others are doing to get ideas and inspiration
  3. When working on visualization, ask yourself these four questions:
    1. What am I trying to show?
    2. What conclusions do I want the user to come up with?
    3. What other conclusions can be reached?
    4. How can I improve the visualization?

We then moved into a discussion of layouts. The purpose of layouts is to take something that is hard to read and improve upon it in a way that makes it easier to read. Layouts can very easily drive different conclusions, so caution is necessary when working through layout improvements. Tarik identified a number of layout strategies, focusing specifically on four types (as outlined below).

  • Force-directed: pros: provide good quality results for 50-500 nodes, flexible, intuitive, interactive; cons: strongly influenced by initial layout and have a high running time
  • Multi-level: user selects super nodes and collapses the neighbors and old edges repeatedly until you reach an idea number of nodes or a certain level; pros: better at larger scales, faster
  • Dimension reduction based: pros: works well on meshes and is faster; it involves picking a node and calculating the distance to all others repeatedly and then running PCA
  • Clustering attribute based: pros: most helpful with hairball networks and fast; cons: produces similar looking graphs and requires order, like tree-based or space filling curves

 

When deciding which layout to use, Tarik suggested that igraph (in R) is a good place to start. From this discussion, Tarik touched on other strategies including some discussed in previous blog posts on centrality (blog post #5), degree (blog post #6), closeness (similar to blog post #6, neighborhood), betweenness (blog post #6), and page rank. Page rank is described as counting the number of and quality of edges where the underlying assumption is that more edges means more important. This means high page rank could occur on less important nodes that are connected to more important nodes.

Before moving into the lab portions where we attempted to make use of the visualization methods discussed, Tarik imparted some final words of wisdom on when to use degree, closeness, betweenness, and page rank:

  • If looking at immediate impact, use degree
  • If looking at dissemination of information, use closeness
  • If looking at critical structure to flow, use betweenness
  • If looking at weighting or missing data, use page rank

See you next week for our last (:() seminar and thanks for following along!

References

Crnovrsanin, T., Muelder, C.W., Faris, R., Felmlee, D., and Kwan-Liu, M. 2014. Visualization techniques for categorical analysis of social networks with multiple edge sets. Social Networks. 37: 56-64.