The Data Fellows program seeks to enable research, fund prototype development, and/or facilitate activities that support the NCDS vision of unleashing the power of big data by developing and mastering data science. The Fellows program also aims to use the NCDS organizational structure to build relationships between industry, government, and academia; bridge gaps between research and practice; promote innovative approaches to addressing data science challenges; and engage the next generation of data scientists.
In addition to furthering the NCDS vision, NCDS Data Fellows will be expected to generate measurable deliverables such as new methods, models, applications, or prototypes that can be used to develop larger efforts supported with extramural funding.
This Year’s Data Fellows
Robert Chew, research data scientist, RTI International
Project Title: SMART: Smarter Manual Annotation for Resource-constrained collection of Training data
Project Summary: Over the past decade, breakthroughs in artificial intelligence have achieved human-level performance on tasks as diverse as object recognition, speech recognition and gaming. Many of these achievements have been due less to recent algorithmic innovation, and moreso due to the availability of (1) powerful and increasingly inexpensive computing resources and (2) open labeled datasets. Though performance gains in computation have historically increased exponentially, human gains in annotating labelled data have not. In the research community and industry, it is often acknowledged that the main bottleneck in machine learning adoption is no longer in engineering algorithms or hardware, but in creating sufficiently large labeled data sets.
To address this concern, active machine learning offers a smarter way to get the most gain from data annotation efforts when labels are expensive to obtain but data collection is cheap. The underlying notion behind active machine learning is that not all observations in a training set are as uniformly informative for a machine learning model to generalize well to new cases. This project will develop an annotation software prototype that leverages elements of active machine learning, gamification, and Ul/UX design to help data scientists and researchers reduce manual coding time and effort, making machine learning classification tasks more affordable and widely accessible.
Biography: Rob Chew, MS, is a Research Data Scientist and Program Manager at RTI International, where he uses his expertise in machine learning, text mining, data visualization, and software development to collaborate with subject matter experts on their complex data problems. Mr. Chew’s research interests broadly lie at the intersection of data science and public health, with a recent focus on computational social science. Currently, Mr. Chew is developing machine learning models to classify user types, communities, and latent attributes on Twitter, using deep learning on satellite images to support survey sampling efforts in developing countries, developing dynamic data visualizations to help policymakers better understand results of a Bayesian meta-regression for program evaluation, and creating a software application to allow police departments to quantify and assess local “near repeat” phenomena. As a program manager, Mr. Chew’s role also extends to mentoring fellow data scientists to support their professional development. Part of the NCDS Fellowship funding will help Mr. Chew mentor Jason Nance, a data scientist in RTI’s Center for Data Science, in the shared development of the SMART application.
Chew holds an MS in Analytics from the Institute for Advanced Analytics at North Carolina State University and a BA in Economics and Environmental Studies from Oberlin College.
Samira Shaikh, assistant professor of cognitive science, department of computer science and department of psychology, University of North Carolina at Charlotte
Project Title: Modeling Persuasion and Group Behavior in Big Data
Project Summary: Online social media platforms provide millions of individuals with a means of expressing their views about a plethora of subjects of import to their lives. This massive communicative effort occurring in the online realm has been shown to impact the offline, real world in measurable ways. Such real-world consequences call for the need to understand how real-world behaviors correspond to behaviors by people in online platforms and how these can be understood and detected by automated methods. This project investigates the propagation of ideas in the online world and their effect in persuading groups of individuals to take action in the real world. The core of proposed work relies on the study of language, and exploits reliable research practices in psycholinguistics and sociolinguistics to investigate human behavior in online platforms, specifically persuasive behavior. The project will deliver an integrated model of persuasion and group behavior in online communication associated with rapid and broad information diffusion and influence on digital media.
Biography: Samira Shaikh (PhD) is Assistant Professor of Cognitive Science in the Department of Computer Science and faculty member of the Data Science Initiative at UNC Charlotte. Dr. Shaikh’s research expertise is in Computational Sociolinguistics, Data Science, Natural Language Processing and Artificial Intelligence. Her work focuses on computational modeling of human behavior in big data, with strong theoretical underpinnings from social science – including those from psychology, communication and anthropology. Previously, Dr. Shaikh was a lead research scientist for the Research Foundation of the State University of New York, where she worked on several large-scale research projects funded by the U.S. Department of Defense. Dr. Shaikh received her PhD in Computer Science from the State University of New York at Albany.
Marcello Balduccini, PhD, assistant research professor, College of Computing and Informatics, Drexel University
Project title: Action-centered Information Retrieval
Information retrieval (IR), which includes everything from searching private databases to consulting Wikipedia, is central to improving healthcare, facilitating scientific discovery, and providing industry with competitive advantages. This project aims to develop action-centered IR, a type of IR that can retrieve information about events (for example, the acquisition of a company) and accurately match it to a query about the state of the world that resulted from that event. Action-centered IR could greatly improve the quality of knowledge gained through data mining, business analytics, and cybersecurity-related queries.
Casey Dietrich, PhD, assistant professor, civil, construction, and environmental engineering, North Carolina State University
Project title: Mapping and Visualization of Coastal Flood Forecasts for Decision Support
Researchers in North Carolina use the Advanced CIRCulation (ADCIRC) model to provide real-time information about storm surge, water inundation, wind speeds, and wave heights during coastal storms. These models are produced constantly during major storms, however, communicating the information in the simulations to end users, such as emergency managers, is more challenging. This project will use visualization techniques to bring ADCIRC model data to emergency managers so they can quickly identify, analyze, and disseminate information about high-risk areas. By incorporating the model data with other data sources, the researchers hope to enable informed decision-making about evacuations and other disaster management efforts.
Shahriar Nirjon, PhD, assistant professor, department of computer science, UNC-Chapel Hill
Project Title: Privacy Analytics of Homegrown IoT Data
The growing network of interconnected devices known as the Internet of Things (IoT) promises to make humans more productive, efficient, and healthy using sensors that monitor everything from the number of steps we take each day to the temperature of our homes. But a major concern about the IoT is its lack of security measures, which creates anxiety about exposing our most personal data. The problem becomes even more complicated when households deploy multiple smart devices—all capable of being hacked and giving up sensitive data. This project seeks to quantify the information being shared among different devices in a smart home environment and prevent secondary information leakage from IoT devices by enforcing privacy policies on all devices without affecting their use.
The 2015 NCDS Data Fellows and their projects are:
David Gotz, PhD, associate professor, School of Information and Library Science, UNC Chapel Hill, and assistant director of the Carolina Health Informatics Program. Visual Analytics for Large-scale Temporal Event Data
Large-scale temporal event data sets can contain vast numbers of long and complex sequences of time-stamped events and are found in a wide range of application domains including social networking activity, security logs, and electronic health records. This project will develop novel visual analytics methods to support exploratory analysis of temporal event data sets that are motivated by population health researchers exploring large collections of electronic medical record (EMR) data. More effective analysis methods for deriving insights from temporal event data such as medical diagnoses, procedures performed, lab tests, and medications prescribed, can provide evidence to support more personalized medical decision making and better health outcomes for patients. It can also be used in comparative effectiveness studies, epidemiological studies, and patient-centered outcomes research. However, current methods for exploring temporal event data and selecting subgroups for analysis are complicated and time consuming. Gotz plans to develop software for comprehensive visual analytics of these data in a way that is simpler, more intuitive, and much less time consuming for practitioners.
Erik Saule, PhD, assistant professor, department of computer science, UNC Charlotte. Toward Machine Oblivious Graph Analysis.
Graphs are a popular tool used to model a wide range of phenomena and to show the relationships among various entities. For example, graphs can be used to model the physical path of city streets or aisles in a store in order to analyze traffic patterns and determine the best locations for businesses or for products within a retail store. In medicine, researchers use graphs to model regulatory pathways and gene expression, predict conditions, and identify the best drugs to use in treatments. Unfortunately, the explosion of digital data has led to a similar explosion in the computational costs of running graph analyses. New algorithms to deal with this challenge are usually inflexible, requiring the researcher to use a specific graph representation or a particular type of computer system for analysis. This project aims to develop a framework for performing efficient graph analysis regardless of the type of analysis being performed or the computer system used.
Erjia Yan, PhD, assistant professor, College of Computing and Informatics, Drexel University. Assessing the Impact of Data and Software on Science Using Hybrid Metrics.
In the age of data, the critical components of scientific and industrial research increasingly are data and software. These products can have significant impacts on future scientific discoveries and business innovation. Yet, they can be difficult to discover and assess because new knowledge is still catalogued in the form of published research papers. This project will address the problem of discovering and assessing the impact of data sets and software by identifying referencing patterns and designing hybrid metrics to assess the full impact of data and software. Unlike current data repository indexing, the project aims to provide context-driven, full text data analytics for data and software in order to account for the unsystematic ways in which these products are cited in scientific literature, including hyperlinks to web pages, footnotes, endnotes, and digital object identifiers. Ultimately, the project seeks to develop a system that will comprehensively capture the impact of data and software on knowledge production and discovery.
This is the second year of the NCDS Data Fellows Program. NCDS membership dues and supplemental funding from UNC General Administration support the program.
The NCDS also extends its thanks to the members who served on the 2015 Data Fellows selection committee:
Larry Alexander, Drexel University
Tom Carsey, Odum Institute, UNC Chapel Hill
Matthew Drahzal, IBM
Steve Gustafson, GE
Russ Gyurek, Cisco
Craig Hill, RTI International
John Moore, MCNC
2014 Data Fellows
Rajeev Agrawal, PhD, assistant professor, department of electronics, computer and information technology, North Carolina A & T State University. Designing Sustainable and Domain Neutral Next Generation Data Infrastructure to Advance Big Data Science.
This project will develop the design specifications for creating a sustainable data infrastructure for data-intensive research problems that is usable by scientists in all research communities. Data-intensive problems, which range from understanding global environmental issues to reverse engineering the brain to genomic sequencing to understand diseases, require a technical infrastructure that works across computer platforms and scientific domains, allows collaboration among researchers at different locations, and can manage, analyze and store huge data sets. The resulting infrastructure could also be a tool for data science education and workforce development.
Jane Greenberg, PhD, professor, School of Information and Library Science, UNC-Chapel Hill, and Director, Metadata Research Center. The Metadata Capital Initiative.
Metadata, or data about data, is crucial if data is to be reused, shared or repurposed for other uses over time. This project will expand on Greenberg’s ongoing work to understand “metadata capital,” or the value—as measured by net gain or loss—of metadata and how that value changes over time. The work will use case studies, collaborative workflow modeling and content analysis to scientifically study metadata capital. Data environments from the National Institute of Environmental Health Sciences, SAS, and RTI, all NCDS member institutions, will be investigated; and data from NCDS member institutions will be considered.
Blair Sullivan, PhD, assistant professor, department of computer science, North Carolina State University. Tracking Community Evolution in Dynamic Graph Data Using Tree-like Structure.
As the amount of available research data has exploded, methods for managing, analyzing and visualizing that information have not kept up, especially in the case of graph or relational data sets. This work will focus on a key task in improving analysis of graph data: the identification and tracking of overlapping groups of similar entities (e.g. people, samples, genes) over time. Tree-like structures of connections exist in these types of data sets. The research will develop new methods for forming a hierarchy of overlapping groups from a combination of the k-core and tree decompositions of a network, and explore its evolution in time-dependent graph data. The goal is to develop new algorithms that will improve data analysis and workflow in fields as diverse as network analysis, healthcare policy, materials science, climate simulation, fluid dynamics, bioinformatics, and cyber security.
Wlodek Zadrozny, PhD., associate professor, College of Computing and Informatics, UNC-Charlotte. Searchable Repository of Resilience and Sustainability Technologies.
This project aims to build a searchable data repository of technologies related to resilience and sustainability (R&S) using advanced information retrieval and text processing methods. Initial data will come from a set of U.S. patents and patent applications that contain thousands of solutions to R&S problems. The project will use an approach to semantic analysis and data preparation partly inspired by the IBM Watson project: A task-based document format, semantic search, and multidimensional scoring of search results.
Justin Zahn, department of computer science, North Carolina A & T State University, COMDET: A Novel Community Detection System for Large Networks.
This work seeks to develop a game-theoretic model for community evolution of large networks, including social and biological networks. It will study the structure and dynamics of network communities, with the goal of inventing novel methods for detecting network communities and building predictive models of the behavior of groups of people by using massive data sets, data mining and machine learning. A better understanding of network communities can impact public policy and health strategies, product development and advertising, or, in biological networks, shed light on the functions of cells, proteins and genes.