The Data Fellows Program provides faculty members from NCDS academic partner institutions and other UNC System institutions the opportunity to address novel and innovative data science research issues.
The Data Fellows program seeks to enable research, fund prototype development, and/or facilitate activities that support the NCDS vision of unleashing the power of big data by developing and mastering data science. The Fellows program also aims to use the NCDS organizational structure to build relationships between industry, government, and academia; bridge gaps between research and practice; promote innovative approaches to addressing data science challenges; and engage the next generation of data scientists.
In addition to furthering the NCDS vision, NCDS Data Fellows will be expected to generate measurable deliverables such as new methods, models, applications, or prototypes that can be used to develop larger efforts supported with extramural funding.
2015 Fellows shared their research during the 2016 Data Fellows Showcase.
The 2015 NCDS Data Fellows and their projects are:
David Gotz, PhD, associate professor, School of Information and Library Science, UNC Chapel Hill, and assistant director of the Carolina Health Informatics Program. Visual Analytics for Large-scale Temporal Event Data
Large-scale temporal event data sets can contain vast numbers of long and complex sequences of time-stamped events and are found in a wide range of application domains including social networking activity, security logs, and electronic health records. This project will develop novel visual analytics methods to support exploratory analysis of temporal event data sets that are motivated by population health researchers exploring large collections of electronic medical record (EMR) data. More effective analysis methods for deriving insights from temporal event data such as medical diagnoses, procedures performed, lab tests, and medications prescribed, can provide evidence to support more personalized medical decision making and better health outcomes for patients. It can also be used in comparative effectiveness studies, epidemiological studies, and patient-centered outcomes research. However, current methods for exploring temporal event data and selecting subgroups for analysis are complicated and time consuming. Gotz plans to develop software for comprehensive visual analytics of these data in a way that is simpler, more intuitive, and much less time consuming for practitioners.
Erik Saule, PhD, assistant professor, department of computer science, UNC Charlotte. Toward Machine Oblivious Graph Analysis.
Graphs are a popular tool used to model a wide range of phenomena and to show the relationships among various entities. For example, graphs can be used to model the physical path of city streets or aisles in a store in order to analyze traffic patterns and determine the best locations for businesses or for products within a retail store. In medicine, researchers use graphs to model regulatory pathways and gene expression, predict conditions, and identify the best drugs to use in treatments. Unfortunately, the explosion of digital data has led to a similar explosion in the computational costs of running graph analyses. New algorithms to deal with this challenge are usually inflexible, requiring the researcher to use a specific graph representation or a particular type of computer system for analysis. This project aims to develop a framework for performing efficient graph analysis regardless of the type of analysis being performed or the computer system used.
Erjia Yan, PhD, assistant professor, College of Computing and Informatics, Drexel University. Assessing the Impact of Data and Software on Science Using Hybrid Metrics.
In the age of data, the critical components of scientific and industrial research increasingly are data and software. These products can have significant impacts on future scientific discoveries and business innovation. Yet, they can be difficult to discover and assess because new knowledge is still catalogued in the form of published research papers. This project will address the problem of discovering and assessing the impact of data sets and software by identifying referencing patterns and designing hybrid metrics to assess the full impact of data and software. Unlike current data repository indexing, the project aims to provide context-driven, full text data analytics for data and software in order to account for the unsystematic ways in which these products are cited in scientific literature, including hyperlinks to web pages, footnotes, endnotes, and digital object identifiers. Ultimately, the project seeks to develop a system that will comprehensively capture the impact of data and software on knowledge production and discovery.
This is the second year of the NCDS Data Fellows Program. NCDS membership dues and supplemental funding from UNC General Administration support the program.
The NCDS also extends its thanks to the members who served on the 2015 Data Fellows selection committee:
- Larry Alexander, Drexel University
- Tom Carsey, Odum Institute, UNC Chapel Hill
- Matthew Drahzal, IBM
- Steve Gustafson, GE
- Russ Gyurek, Cisco
- Craig Hill, RTI International
- John Moore, MCNC
2014 Data Fellows
Rajeev Agrawal, PhD, assistant professor, department of electronics, computer and information technology, North Carolina A & T State University. Designing Sustainable and Domain Neutral Next Generation Data Infrastructure to Advance Big Data Science.
This project will develop the design specifications for creating a sustainable data infrastructure for data-intensive research problems that is usable by scientists in all research communities. Data-intensive problems, which range from understanding global environmental issues to reverse engineering the brain to genomic sequencing to understand diseases, require a technical infrastructure that works across computer platforms and scientific domains, allows collaboration among researchers at different locations, and can manage, analyze and store huge data sets. The resulting infrastructure could also be a tool for data science education and workforce development.
Jane Greenberg, PhD, professor, School of Information and Library Science, UNC-Chapel Hill, and Director, Metadata Research Center. The Metadata Capital Initiative.
Metadata, or data about data, is crucial if data is to be reused, shared or repurposed for other uses over time. This project will expand on Greenberg’s ongoing work to understand “metadata capital,” or the value—as measured by net gain or loss—of metadata and how that value changes over time. The work will use case studies, collaborative workflow modeling and content analysis to scientifically study metadata capital. Data environments from the National Institute of Environmental Health Sciences, SAS, and RTI, all NCDS member institutions, will be investigated; and data from NCDS member institutions will be considered.
Blair Sullivan, PhD, assistant professor, department of computer science, North Carolina State University. Tracking Community Evolution in Dynamic Graph Data Using Tree-like Structure.
As the amount of available research data has exploded, methods for managing, analyzing and visualizing that information have not kept up, especially in the case of graph or relational data sets. This work will focus on a key task in improving analysis of graph data: the identification and tracking of overlapping groups of similar entities (e.g. people, samples, genes) over time. Tree-like structures of connections exist in these types of data sets. The research will develop new methods for forming a hierarchy of overlapping groups from a combination of the k-core and tree decompositions of a network, and explore its evolution in time-dependent graph data. The goal is to develop new algorithms that will improve data analysis and workflow in fields as diverse as network analysis, healthcare policy, materials science, climate simulation, fluid dynamics, bioinformatics, and cyber security.
Wlodek Zadrozny, PhD., associate professor, College of Computing and Informatics, UNC-Charlotte. Searchable Repository of Resilience and Sustainability Technologies.
This project aims to build a searchable data repository of technologies related to resilience and sustainability (R&S) using advanced information retrieval and text processing methods. Initial data will come from a set of U.S. patents and patent applications that contain thousands of solutions to R&S problems. The project will use an approach to semantic analysis and data preparation partly inspired by the IBM Watson project: A task-based document format, semantic search, and multidimensional scoring of search results.
Justin Zahn, department of computer science, North Carolina A & T State University, COMDET: A Novel Community Detection System for Large Networks.
This work seeks to develop a game-theoretic model for community evolution of large networks, including social and biological networks. It will study the structure and dynamics of network communities, with the goal of inventing novel methods for detecting network communities and building predictive models of the behavior of groups of people by using massive data sets, data mining and machine learning. A better understanding of network communities can impact public policy and health strategies, product development and advertising, or, in biological networks, shed light on the functions of cells, proteins and genes.