Merging data is often one of the most simple seeming, but actually complicated tasks in any statistical software program.
In this post, I try to provide a brief introduction to merging data in Stata.
I use three different data sets:
individual.dta, a hypothetical data on individuals
individual2.dta, a hypothetical data set of supplementary individual data
region.dta, a hypothetical data set at the region level.
Quick Look At Each Of The Data Sets
individual.dta
Show the code
use individual.dta, clearlist
| name id age region |
|-----------------------------------|
1. | A. Cesaire 1 40 Caribbean |
2. | K. Nkrumah 2 50 Africa |
3. | L. Senghor 3 60 Africa |
+-----------------------------------+
| region z |
|----------------|
1. | Caribbean 50 |
2. | Africa 60 |
+----------------+
Types of Merges
In Stata, there are four types of merges 1:1 (one to one), 1:m (one to many), m:1 (many to one), and the much less used m:m (many to many).
In this post, I consider the 1:1 and m:1 merges.
Looking at your data using list or browse is often very helpful. In this post, I use the list in x syntax to list the first x observations of different data sets.
Key Idea
The basic of idea of merging data is:
Use the use command to open the first data set in your merge.
e.g. use individual.dta, clear
merge with the second data set using the following syntax:
merge type-of-merge variables-to-merge-on using second-data-set.
e.g. merge 1:1 id using individual2.dta
One to One (1:1) Merges
Here I consider merging individual.dta, with the supplementary data individual2.dta.
I list out some observations in each data file. Notice that each data file has id numbers for the same individuals. There is one individual in individual.dta for each individual in individual2.dta.
Show the code
use individual.dta, clearlistin 1/3use individual2.dtalistin 1/3
| name id age region |
|-----------------------------------|
1. | A. Cesaire 1 40 Caribbean |
2. | K. Nkrumah 2 50 Africa |
3. | L. Senghor 3 60 Africa |
+-----------------------------------+
+----------+
| id x |
|----------|
1. | 1 90 |
2. | 2 100 |
3. | 3 110 |
+----------+
merge-ing these two data sets is straightforward. We first use the first data set, individual.dta, and then merge it with individual2.dta.
The output of merge is the results of the merging process.
Show the code
use individual.dta, clearmerge 1:1 id using individual2.dta
Result Number of obs
-----------------------------------------
Not matched 0
Matched 3 (_merge==3)
-----------------------------------------
Because there are exactly the same id numbers in both data sets, we see that 3 observations have matched.
Show the code
listin 1/3
| name id age region x _merge |
|-------------------------------------------------------|
1. | A. Cesaire 1 40 Caribbean 90 Matched (3) |
2. | K. Nkrumah 2 50 Africa 100 Matched (3) |
3. | L. Senghor 3 60 Africa 110 Matched (3) |
+-------------------------------------------------------+
Many to One (m:1) Merges
Here I consider merging individual.dta, with the regional level data region.dta.
I list out some observations for each data file. Notice that each data file has region information for the same regions. There are multiple individuals in individual.dta for each region in region.dta.
Show the code
use individual.dta, clearlistin 1/3use region.dta, clearlistin 1/2
| name id age region |
|-----------------------------------|
1. | A. Cesaire 1 40 Caribbean |
2. | K. Nkrumah 2 50 Africa |
3. | L. Senghor 3 60 Africa |
+-----------------------------------+
+----------------+
| region z |
|----------------|
1. | Caribbean 50 |
2. | Africa 60 |
+----------------+
merge-ing these two data sets is straightforward. We first use the first data set, individual.dta, and then merge it with region.dta.
The output of merge is the results of the merging process.
Show the code
use individual.dta, clearmergem:1 region using region.dta
Result Number of obs
-----------------------------------------
Not matched 0
Matched 3 (_merge==3)
-----------------------------------------
Because there are exactly the same region’s listed in both data sets, we see that 3 observations have matched.
Show the code
listin 1/3
| name id age region z _merge |
|------------------------------------------------------|
1. | L. Senghor 3 60 Africa 60 Matched (3) |
2. | K. Nkrumah 2 50 Africa 60 Matched (3) |
3. | A. Cesaire 1 40 Caribbean 50 Matched (3) |
+------------------------------------------------------+