STAT6128
Key Topics in Social Science: Measurement and Data
Computer Workshop 4 -Social Mobility
The data
The data we shall be using today comes from the 2006 Programme for International Student Assessment (PISA). This data is designed to be cross-nationally comparable across a wide selection of developed nations. Today we shall focus on occupations. Recall from the lectures that this is the primary outcome of interest for Sociologists. However, in PISA, we cannot measure social mobility in itself; PISA is cross-sectional data, and therefore we do not have any information on children’s eventual outcomes. Instead, we shall investigate the relationship between parental occupation and 15 year old children’s occupational expectations (what job they expect to have when they are 30 years old). So just for today, think of these expectations as if are actual outcomes. (As an aside, there has been some work by sociologists and economists who claim expectations mediate the link between social background and attainment during adulthood. So in fact this type of analysis could actually be quite interesting for our understanding of intergenerational mobility).
Start Stata. Create a do file like last week (use ‘version’ to tell Stata which version you use, use the ‘cd’ command to tell Stata from where to open data files and where to save the do file, use the ‘use’ command to open the Stata dataset PISA_IM, which you first need to download from Blackboard into the folder you name behind the ‘cd’ command.)
Like last week, write the bold command lines into your do file and the italic ones into the command window.
Country code and sample size
Once you have opened your data set type
label list Country
You receive the error command ‘value label Country not found’ . As a consequence, the data do not contain any information on which value refers to which country. Given that the data set does not contain information on the country coding,I give it to you here:
Country code
|
Country name
|
208
|
Denmark
|
276
|
Germany
|
352
|
Iceland
|
380
|
Italy
|
410
|
Korea
|
442
|
Luxembourg
|
554
|
New Zealand
|
616
|
Poland
|
620
|
Portugal
|
792
|
Turkey
|
Type
tab Country
You see a table giving the 3-digit country code. Each number in this first column represents one country. The second column gives you the sample size per country, the third column the percentage of the sample per country.
Measurement of Occupation (Ganzeboom Index)
As mentioned in the lecture, there are many different ways one can “measure” (or rank) occupations. The main method PISA uses is the Ganzeboom ISEI indexof social class. This is a “continuous” measure of occupational prestige, and basically ranks occupations through their impact on people’s income.
To begin, we use this as our measure of occupation. The three variables of interest are: Father’s occupation is labelled BFMJ
Mother’s occupation is labelled BMMJ
Child’s (expected) occupation is labelled BSMJ
Let us investigate BSMJ first. To find out more about the distribution of the variable BSMJ, type:
sum BSMJ, d
Something is wrong…….more than the top 10% of data is coded at one point (“99”).
Normally missing values in Stata are coded as “.” As such,they would be excluded in all commands. However, the original data was coded in SPSS. In SPSS, the missing were coded with the value 99. Transferring the SPSS file into Stata leads to a data point 99, since the transfer was not done properly.
Type
label list BSMJ
You see that 97 and 99 values attributed to the variable are coded as missing values.
If the SPSS data had been transferred properly into Stata format, the missing values should be coded ‘ . ’
We will do that now ourselves.
Type
gen bsmj=BSMJ
(you generate a variable that has exactly the same values as your original BSMJ variable)
replace bsmj=. if BSMJ>96
Now type
sum bsmj, d
Compare this with the sum command beforehand. You see that if missing values are properly coded in Stata (with a ‘.’) then Stata does not show them.
Sometimes you might want to see them though. In this case you can type
tab bsmj, m
The m here tells Stata you want to see the missings. You see, that 17 % of values are missing for children’s expected occupation.
Also the variables BFMJ and BMMJ have allocated the values 97 and 99 to missings. Please independently try to create a variable bfmj and bmmj that have the missing values coded properly as ‘ . ’. The solution is given on the next page.
gen bfmj=BFMJ
replace bfmj=. if BFMJ>96
gen bmmj=BMMJ
replace bmmj=. if BMMJ>96
We now want to see how children’s expected occupation is associated with their parents’ occupation. As our measures are “continuous”, we shall use OLS regression.
Firstly, we need to take into account PISA’s complex sampling design. We covered this last week. The PISA survey design uses clustered sampling: first schools are selected and then students within schools. Clustering increases the standard error. We therefore need to tell Stata to take clustering into account.
Type:
svyset SCHOOLID [pw=W_FSTUWT]
This has set up the complex survey design. Now let us perform a regression, relating fathers’ occupation to the child’s expectation. We will estimate this model using all observations from all countries. Type:
svy: regress bsmj bfmj i.Country
The prefix i. before the variable Country indicates that this is a categorical variable. In this case, we have 10 countries (10 categories) in the variable Country. Hence Stata will create 9 dummy variables.
You should get something like the following output:
The table shows you that there are 788 schools in your data (Number of PSUs), the total sample size is 37,560 students.
Now interpret this table. Which country is the reference country? (Tip: look at the table with the country codes given beforehand)
The coefficient of interest is the one associated to BFMJ. It is positive and statistically significant. This suggests that a 1 point increase in fathers Ganzeboom index is associated with a 0.234 point increase in the child’s Ganzeboom index.
Remember, last week we talked in the lecture briefly about how to interpret regression results. The Ganzeboom index lacks a natural metric (scale). How could we give some more meaning to our results here? We could express the change in the Ganzeboom index in terms of standard deviations.
Find the standard deviations of bfmj and bsmj by typing:
svy: mean bfmj
estat sd
svy:mean bsmj
estat sd
You will receive the following results:
|
Mean
|
Standard deviation
|
bfmj
|
42.73
|
15.86
|
bsmj
|
60.59
|
16.81
|
Question:
If the fathers Ganzeboom index increases by one standard deviation, by how many standard deviations will the child’s index increase? You know that a 1 point increase in the father’s index increases the child’s index by 0.234 points.
0.234*15.86=3.71
Hence if father’s index increases by one standard deviation, the child index increases by 3.71 points. We can express the 3.71 points in standard deviations:
3.71/16.81=0.22 Result:
If the father’s Ganzeboom index increases by one standard deviation, the child’s index increases by 0.2 standard deviations.
In conclusion our regression results show that from an intergenerational mobility perspective, we can say that children of fathers with higher ranking occupations enter (or at least “expect to enter”) better jobs.
How does this vary across developed nations? To get a rough idea (and only this time ignoring the complex sampling design), type:
bysort Country: regress bsmj bfmj
tab Country,gen(C)
forval i=1(1)10{
svy, subpop(C`i'): regress bsmj bfmj
}
This generates a set of dummy variables for each country (named C1-C10); then uses a loop to execute a svy:regress command for each of these countries.
This has reproduced the analysis for each individual country. Notice the relationship is weakest in Turkey (country 792) and Korea (country 410). It seems that the jobs children “expect” to enter in these countries are not strongly associated with their father’s occupation. On the other hand, in Poland (country 616) the relationship is particularly strong.
Alternative measure of occupation
Perhaps in this case another way of measuring occupation may also be suitable.
The PISA dataset contains an alternative measure of occupation; 4 digit ISCO codes. This is the ILO classification of occupation, look at the following webpage:
http://www.ilo.org/public/english/bureau/stat/isco/index.htm
This data is very interesting because of its detail. Occupations are defined into over 300 categories. However, for today we will convert this into a binary measurement
(“Professional” and “Non-Professional” jobs). In other words, we will examine the
relationship between whether a child is expecting to enter a professional job and whether the child’s parents have a professional job. (We could go further by using logistic regression to investigate this relationship. We will examine logistic regression in a later workshop.)
Let us start with this conversion. Create a variable called Student_Pro, which has the value 1 if the variable Student_Occ_ICSO is below 3000 (that means the student aims to become a “Professional”) and it is 0 if the value of Student_Occ_ICSO is 3000 and above. In
addition, give the newly created variable Student_Pro a missing value ‘ .’, if the value of a Student_Occ_ICSO is 9999. First, try yourself to create this variable Student_Pro. If you do not manage the code is given on the next page.
gen Student_Pro=.
replace Student_Pro=0 if Student_Occ_ICSO>2999
replace Student_Pro=1 if Student_Occ_ICSO<3000
replace Student_Pro=. if Student_Occ_ICSO==9999
Now create the variable Father_Pro and Mother_Pro using the same specification:
gen Father_Pro=.
replace Father_Pro=0 if Father_Occ_ICSO>2999
replace Father_Pro=1 if Father_Occ_ICSO<3000
gen Mother_Pro=.
replace Mother_Pro=0 if Mother_Occ_ICSO>2999
replace Mother_Pro=1 if Mother_Occ_ICSO<3000
Now type the following:
svy:tabulate Father_Pro Student_Pro , row
svy:tabulate Mother_Pro Student_Pro , row
What do these results show?
Up to now, we have looked at all countries together. Now let’s examine Poland and Korea separately.
Start with Korea. Type:
svy:tabulate Father_Pro Student_Pro if Country==410, row
svy:tabulate Mother_Pro Student_Pro if Country==410, row
Then do the same for Poland (code 616).
What results do you find? Compare the tables.
Measurement Error
We shall finish this part of the workshop by briefly considering the role of measurement error. Firstly, recall from the lectures that children act as proxy respondents for their parents. That is, it is children who report their parents’ education and occupation (not the parents themselves). Children may not always report this correctly.
For this set of countries, however, data has been collected from both the parent and the child (note this was not done for all countries, and was not done in the PISA 2000 or 2003 waves). We can therefore investigate how well children report their parents’ occupation. In particular,
Parent_Report_Father_Occ_ICSO is fathers’ reports of their own occupation
Parent_Report_Father_Pro is fathers’ reports about whether they are a professional Parent_Report_Mother_Occ_ICSO is mothers’ reports of their own occupation
Parent_Report_Mother_Pro is mothers’ reports about whether they are a professional
Let’s consider whether children can accurately report if their mother or father is a professional. Type (ALL ON ONE LINE):
tab Parent_Report_Father_Pro Father_Pro if Parent_Report_Father_Pro!=. &
Father_Pro!=., col
Look at the main diagonal (top left to bottom right). If there was no measurement error, all observations would be in these cells. Instead, we can see some misclassification: children report their father to be a professional when he is not (and viceversa). This is of course assuming that parents accurately report their own occupation …