STAT6128: Key Topics in Social Science: Measurement and Data
Week 4 Supplementary worksheet: data transformation in Stata; and using Stata’s documentation as aresource for learning
Learning how to transform/ manipulate data in an appropriate way, to prepare data for subsequent analysis, is a key skill. In the workshops thus far we have assumed that the data is an appropriate structure for you to use. This is not always the case. For example before proceeding with your analysis you may need first to (for example):
• append one dataset to the end of another dataset
• merge datasets- join corresponding observations in different datasets by matching on key
variable(s)
• reshape the data from a ‘long’ to a ‘wide’ structure, or viceversa
In this workshop you will be introduced to a few important ways to transform / manipulate data in Stata. Spend sometime working through the examples on the following pages.
Source of material in this workshop
Please note that all of these examples (the text and the data) are taken directly from Stata's documentation, which includes fully worked examples using downloadable datasets.
See https://www.stata.com/features/documentation/ . In this workshop I have simply directly selected documentation relating to the append, merge and reshape from the data
management manual (https://www.stata.com/bookstore/data-management-reference-manual/). The original sources are:
append https://www.stata.com/manuals/dappend.pdf merge https://www.stata.com/manuals/dmerge.pdf reshape https://www.stata.com/manuals/dreshape.pdf
For more detail on these commands, go directly to these sources. This will also help you gain familiarity with Stata’s documentation, which is a great resource for your subsequent learning.
Append
Example 1
This command appends (joins) a dataset to the end of a dataset already in Stata’s memory. To illustrate we will append two example ‘toy’ datasets from the Stata website. The first dataset, called even.dta, contains the sixth through eighth positive even numbers. The second dataset, called odd.dta, contains the first five positive odd numbers. We’ll have a look at each dataset first. Type:
use http://www.stata-press.com/data/r15/even list
Typing list provides a list of values of variables in the data. We could also have looked at the data in the data window. You should see the following in the Results window:
Let’s have a look at the other dataset:
use http://www.stata-press.com/data/r15/odd1
list
We will append the even data to the end of the odd data. Because the odd data are already in memory (we just used them above),we type
append using http://www.stata-press.com/data/r15/even
list
Because the number variable is in both datasets, the variable was extended with the new data from the file even.dta. Because there is no variable called odd in the new data, the additional observations on odd were forward-filled with missing (.). Because there is no variable called even in the original data, the first observations on even were back-filled with missing.
Example 2
Suppose that we have several datasets containing the populations of counties in various states. We can use append to combine these datasets all at once and use the generate() option to create a variable identifying from which dataset each observation originally came. See the following example:
use http://www.stata-press.com/data/r15/capop, clear list
append using http://www.stata-press.com/data/r15/ilpop http://www.stata-press.com/data/r15/txpop,
generate(state)
label define statelab 0 "CA" 1 "IL" 2 "TX"
label values state statelab
list
Merge
merge is for adding new variables from a second dataset to existing observations (if you wish to add new observations to existing variables, then use append). merge joins corresponding observations from the dataset currently in memory (called the ‘master ’ dataset) with those from filename.dta (called the ‘using ’ dataset), matching on one or more key variables. merge can perform one-to-one, one-to-many, many-to-one, and many-to-many merging. Here we will look at one-to-one merging.
Example
We have two datasets, one of which has information about the size of old automobiles, and the other of which has information about their expense:
use http://www.stata-press.com/data/r15/autosize
list
Use http://www.stata-press.com/data/r15/autoexpense list
We can see that these datasets contain different information about nearly the same cars—the autosize file has one more car. We would like to get all the information about all the cars into one dataset. Because we are adding new variables to old variables, this is a job for the merge command. We need only to decide what type of match merge we need. Looking at the datasets, we see that the make variable, which identifies the cars in each of the two datasets, also identifies individual observations within the datasets. What this means is that if you tell me the make of car, I can tell you the one observation that corresponds to that car. Because this is true for both datasets, we should use a 1:1 merge.
We will start with a clean slate to show the full process. We start with the ‘master’ data and merge in the ‘using’ data. In this example the variable make is the ‘key’/‘matching’ variable.
use http://www.stata-press.com/data/r15/autosize
merge 1:1 make using http://www.stata- press.com/data/r15/autoexpense
list
The merge is successful—all the data are present in the combined dataset, even that from the one car that has only size information. A new variable is created in the merged dataset, _merge, which reports the source of the observation. If we wanted only those makes for which all information is present, we could drop the observations for which merge < 3.
Reshape
reshape converts data from ‘wide’ to ‘long’ form and vice versa. Think of the data as a collection of observations Xij, where i is the logical observation, or group identifier, and j is the subobservation, or within-group identifier. Wide-form data are organized by logical observation, storing all the data on a particular observation in one row. Long-form data are organized by subobservation, storing the data in multiple rows.
Example
For example, we might have data on a person’s ID, gender, and annual income over the years 1980–1982. We have two X ij variables with the data in ‘wide ’ form.
use http://www.stata-press.com/data/r15/reshape1
list
To convert these data to the ‘long ’ form, we type
reshape long inc ue, i(id) j(year)
where i is the logical observation, or group identifier (in this case id), and j is the subobservation, or within-group identifier (in this case year). Inc and ue are the ‘stubs’ of the variable names to be reshaped.
There is no variable named year in our original, ‘wide-form ’ dataset. year will be a new variable in our ‘long ’ dataset. After this conversion, we have
list, sep(3)
We can return to our original, wide-form. dataset by using reshape wide. reshape wide inc ue, i(id) j(year)
list
Converting from wide to long creates the j (year) variable. Converting back from long to wide drops the j (year) variable.