GraphXD: Network analysis for data science applications across STEAM

February 26, 2021

Adam G. Anderson

Graphs are models of data, and are used to analyze and visualize different features / variables across a dataset. They have proven to be useful in providing at-a-glance summaries (e.g. ‘signals’) of big and complex datasets. Typically a Euclidean graph (with X and Y axes, representing one dimension each) can only view two dimensions of a given dataset. In order to view multiple dimensions in the same graph, one needs to employ non-Euclidean methods, and this is why networks are so useful.

Network analysis allows each datapoint (i.e. ‘node’) to contain an infinite number of features (or variables) for a given dataset. These features can be in different data types (e.g. strings, integers, boleans, and decimals; floats, doubles, etc.), and can contain specialized data like time intervals and geo-coordinates. More importantly, the graph database structure of a network allows for each node to be connected to any other node based on a measurable relation between them. These relations are called ‘edges or links’, and were first described in graph theory as a 'Hamiltonian path.'

We’ve all heard of Stanley Milgram’s famous study on the Six Degrees of Separation. This work found that each of us are connected to each other to some ‘degree’ or another. Degree is a statistical measurement inherent in each network, which corresponds to the number of edges extending from a node. Intuitively, each of us are familiar with our immediate first degree, i.e. those who we know personally by name, but we may not know all of their friends, and we certainly don’t know all the friends of their friends. This means that there’s an exponential curve to each order of degree, and it’s what makes networks so powerful, but also so difficult to understand outside of a formal model.

Social networks are not the only things networks can model (see this gallery of different network models). We can also use networks to model relations between objects as ‘nodes’. connected by ‘edges or links’. More formally, these networks are called graphs, and they form the basis of graph databases. One example of such a graph is the GraphXD logo; the circles are the nodes of that graph, and the lines are the edges. A good example of this type of graph in the real-world is the Linked Open Data initiative, which is working toward building an ontology for linking important datasets with persistent IDs.

Scientists, researchers, and theorists studying graphs in a variety of fields leverage different software — e.g., NetworkX, iGraph, or Gephi — to present their results. NetworkX has a vast library of algorithms for determining centrality and eigenvalues, which can be used to identify certain prominent nodes as the ‘leaders’, ‘hubs’ and ‘authorities’ evident in a cluster of nodes, and bridge-nodes which span multiple clusters in a network (see Kleinberg 1997; Freeman 1977).

GraphXD (Graphs Acros Domains) is a BIDS initiative to bring together a community of scholars and practitioners of network science.

Topics