The NCDS Data Observatory is a core element of the NCDS. It’s purpose is to create a diverse repository of very large data sets for NCDS members to use and share in support of the mission to advance data science. The Observatory will provide a place for those interested in the science of data to form a community to exchange tools, approaches, data and other relevant information. Its shared, distributed infrastructure will give researchers the space to develop the theoretical underpinnings to inform data science advances.
The Observatory will be the foundation for the NCDS Data Laboratory, a virtual lab that will provide data science researchers with access to the emerging tools and physical infrastructure they need to test radically new techniques for storing, sharing, analyzing, transforming, and visualizing data.
The Data Observatory will evolve as the NCDS grows. NCDS members are encouraged to contribute to the NCDS Data Observatory by submitting ideas for additional resources to add.
Another mission of the NCDS is teaching. Information for the courses that have been taught as part of the NCDS Data Science Curriculum can be seen here: NCDS Data Pilot Data Science Courses.
NCDS Observatory Dataverse Network
NCDS Observatory iRods Grid
(Not Yet Released)
The Statewide Digital Elevation Model
These data are the bare earth digital elevation model of the state of North Carolina. The model was commissioned by the North Carolina Floodplain Mapping Program (NCFMP), and was generated from processed LIDAR data. The model is stored in ESRI ASCII files, each containing the DEM information for a grid spanning 1 degree of latitude by 1 degree of longitude.
The North Carolina Forecast System
These data are output from the storm surge, wind wave, and tide model ADCIRC, in NetCDF format. RENCI operates ADCIRC in a forecast mode for the North Carolina coastal waters and provides the forecasted water levels to decision makers in North Carolina and surrounding areas.
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
The First Billion Digits of Π
These data originate from PI World of JA0HXV. The data are organized in a flat text file with the digits arranged in 10 lines of 10 digits.
The U.S. Census: 1 Billion RDF Triples
This is a well-documented data set based on the well-described 2000 US Census data. In addition to reorganizing the data via an RDF format, the data are exposed via SPARQL.
Tiny Images Dataset
Tiny Images dataset, which consists of 79,302,017 images, each being a 32×32 color image. This data is stored in the form of large binary files which can be accessed by a Matlab toolbox.
Twitter Census – Developer Tools from Infochimps
Twitter data from over 24 million tweets scraped from March 2006 to November 2009.
A programmer’s guide to big data: 12 tools to know
SAS on Big Data
AMP Camp – Big Data Minicourse
IBM Tech Article
Submit your own links or data or ask for more information