Merging Data With Stata

Andy Grogan-Kaylor

Introduction

Merging data is often one of the most simple seeming, but actually complicated tasks in any statistical software program.

In this post, I try to provide a brief introduction to merging data in Stata.

I use three different data sets:

individual.dta, a hypothetical data on individuals
individual2.dta, a hypothetical data set of supplementary individual data
region.dta, a hypothetical data set at the region level.

Quick Look At Each Of The Data Sets

`individual.dta`

Show the code


use individual.dta, clear

list

     |       name   id   age      region |
     |-----------------------------------|
  1. | A. Cesaire    1    40   Caribbean |
  2. | K. Nkrumah    2    50      Africa |
  3. | L. Senghor    3    60      Africa |
     +-----------------------------------+

`individual2.dta`

Show the code


use individual2.dta, clear

list

     | id     x |
     |----------|
  1. |  1    90 |
  2. |  2   100 |
  3. |  3   110 |
     +----------+

`region.dta`

Show the code


  
use region.dta, clear

list

     |    region    z |
     |----------------|
  1. | Caribbean   50 |
  2. |    Africa   60 |
     +----------------+

Types of Merges

In Stata, there are four types of merges 1:1 (one to one), 1:m (one to many), m:1 (many to one), and the much less used m:m (many to many).

In this post, I consider the 1:1 and m:1 merges.

Looking at your data using list or browse is often very helpful. In this post, I use the list in x syntax to list the first x observations of different data sets.

Key Idea

The basic of idea of merging data is:

Use the use command to open the first data set in your merge.

e.g. use individual.dta, clear

merge with the second data set using the following syntax:

merge type-of-merge variables-to-merge-on using second-data-set.

e.g. merge 1:1 id using individual2.dta

One to One (`1:1`) Merges

Here I consider merging individual.dta, with the supplementary data individual2.dta.

I list out some observations in each data file. Notice that each data file has id numbers for the same individuals. There is one individual in individual.dta for each individual in individual2.dta.

Show the code


use individual.dta, clear

list in 1/3

use individual2.dta

list in 1/3

     |       name   id   age      region |
     |-----------------------------------|
  1. | A. Cesaire    1    40   Caribbean |
  2. | K. Nkrumah    2    50      Africa |
  3. | L. Senghor    3    60      Africa |
     +-----------------------------------+

     +----------+
     | id     x |
     |----------|
  1. |  1    90 |
  2. |  2   100 |
  3. |  3   110 |
     +----------+

merge-ing these two data sets is straightforward. We first use the first data set, individual.dta, and then merge it with individual2.dta.

The output of merge is the results of the merging process.

Show the code


use individual.dta, clear

merge 1:1 id using individual2.dta

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                                 3  (_merge==3)
    -----------------------------------------

Because there are exactly the same id numbers in both data sets, we see that 3 observations have matched.

Show the code


list in 1/3

     |       name   id   age      region     x        _merge |
     |-------------------------------------------------------|
  1. | A. Cesaire    1    40   Caribbean    90   Matched (3) |
  2. | K. Nkrumah    2    50      Africa   100   Matched (3) |
  3. | L. Senghor    3    60      Africa   110   Matched (3) |
     +-------------------------------------------------------+

Many to One (`m:1`) Merges

Here I consider merging individual.dta, with the regional level data region.dta.

I list out some observations for each data file. Notice that each data file has region information for the same regions. There are multiple individuals in individual.dta for each region in region.dta.

Show the code


use individual.dta, clear

list in 1/3

use region.dta, clear

list in 1/2

     |       name   id   age      region |
     |-----------------------------------|
  1. | A. Cesaire    1    40   Caribbean |
  2. | K. Nkrumah    2    50      Africa |
  3. | L. Senghor    3    60      Africa |
     +-----------------------------------+

     +----------------+
     |    region    z |
     |----------------|
  1. | Caribbean   50 |
  2. |    Africa   60 |
     +----------------+

merge-ing these two data sets is straightforward. We first use the first data set, individual.dta, and then merge it with region.dta.

The output of merge is the results of the merging process.

Show the code


use individual.dta, clear

merge m:1 region using region.dta

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                                 3  (_merge==3)
    -----------------------------------------

Because there are exactly the same region’s listed in both data sets, we see that 3 observations have matched.

Show the code


list in 1/3

     |       name   id   age      region    z        _merge |
     |------------------------------------------------------|
  1. | L. Senghor    3    60      Africa   60   Matched (3) |
  2. | K. Nkrumah    2    50      Africa   60   Matched (3) |
  3. | A. Cesaire    1    40   Caribbean   50   Matched (3) |
     +------------------------------------------------------+

Introduction

Quick Look At Each Of The Data Sets

individual.dta

individual2.dta

region.dta

Types of Merges

One to One (1:1) Merges

Many to One (m:1) Merges

`individual.dta`

`individual2.dta`

`region.dta`

One to One (`1:1`) Merges

Many to One (`m:1`) Merges