Examining Data Formats. Data Sourcing

((ATTENTION there are 2 files I don’t know how to send it ))


Homework #1 – Data Sourcing – The objective of this assignment is to expose you to some issues of acquiring data. First, we will examine some of the different formats that are used with data files. Oftentimes you can select the format that you will work with, but sometimes systems will force you to accept set data formats. Second, you will see firsthand how data are often provided in either an untidy format or just downright “dirty.” Working with data can be messy, it requires a careful, observant, and methodical approach.

This assignment has two parts: (1) Multiple Formats – why are there different formats? You will work with RDF, XML, JSON, and CSV formats. (2) Identifying Issues with a dataset and developing a strategy to manage/correct issues.

Here is a link to an online data wrangling tool: http://vis.stanford.edu/wrangler/. Feel free to explore it. You are NOT required to use it.

Part 1: Multiple Formats

Download the four zipcodeDemographics files. These files contain the same data but in different formats. Using a text editor examine the files and note differences in their structure. Perform some research to obtain brief definitions of each format, write your definitions stating your referenced source(s). Is any format easier to read compared to the others? If so, why?

Examine the file and characterize it (i.e., pick a format to analyze). Imagine that you must present your findings to an executive committee that wants to know what is important in the data. Prepare a single PowerPoint slide to succinctly characterize the data.

Part 2: Data Wrangling

Data wrangling is the process of manipulating data to change it into a format that can be used for analysis so that the data can be used in automated/computerized analysis tools. There are two major stages: data tidying and data cleaning. Data tidying is the process of transforming data so that it is in a format where it can be readily processed by automated tools. Data cleaning is the process of transforming the data to insure to the greatest extent possible that it accurately reflects the subject that it pertains to.

Download the University returns_for_figshare_FINAL.csv file. Using a tool such as Excel examine the file. Identify as many issues as you can with the dataset. For issues that you identify develop a strategy (or strategies) as to how they can be dealt with, report the issues and strategies.

Submit a MS Word document with your findings from parts 1 & 2.

What our clients say
Daphne Whitby
Daphne Whitby
My homework required that I use Java to produce a programming assignment. I’ve been running up and down with friends and workThank you for  your help 
Arnold M
Arnold M
This site did honor their end of the bargain. I have been searching for a college essay help services for a while, and finally, I found the best of the best.
Regina Smith
Regina Smith
I received my essay early this morning after I had placed an order last night. I was so amazed at how quickly they did my work. The most surprising thing is that I was not asked to pay for extra due to the short notice!! I am a happy student