13. Merging¶

We often find the need to work with data from multiple tables, even when that data is about the same topic! For example, both our wildfire data and our climate data had rows about 2019, or in a separate setting a store might keep track of their sales in one table but keep info on each item sold in a separate table. In either case, we’ll likely find it useful to combine the tables and put the data in one place so that we can conduct our analysis. This is called ‘merging’ or ‘joining’.

Let’s entertain the second example in order to see how merging works. Suppose we’re a paint retailer, and we keep a table of transactions and a separate table of the paints we sell.

import babypandas as bpd

orders = bpd.DataFrame().assign(
    order_number=range(111,111+5),
    item=['BEHR-N2', 'BEHR-N2', 'KM-D4', 'SW-K3', 'VALS-N2'],
    sale=[35.98, 31.08, 28.99, 35.98, 39.77]
)
orders

	order_number	item	sale
0	111	BEHR-N2	35.98
1	112	BEHR-N2	31.08
2	113	KM-D4	28.99
3	114	SW-K3	35.98
4	115	VALS-N2	39.77

paints = bpd.DataFrame().assign(
    item_id=['BEHR-N2', 'KM-D4', 'SW-K3', 'SW-Q0', 'VALS-N2'],
    brand=['Behr', 'Kelley-Moore', 'Sherwin-Williams', 'Sherwin-Williams', 'Valspar'],
    color=['blue', 'red', 'green', 'black', 'blue'],
    weight=[11.4, 8.9, 10.0, 10.1, 9.2]
)
paints

	item_id	brand	color	weight
0	BEHR-N2	Behr	blue	11.4
1	KM-D4	Kelley-Moore	red	8.9
2	SW-K3	Sherwin-Williams	green	10.0
3	SW-Q0	Sherwin-Williams	black	10.1
4	VALS-N2	Valspar	blue	9.2

A logical course of inquiry would be to figure out what kind of paint our customers like to buy, so that we can prioritize marketing and stocking that paint. Congrats, you’ve just discovered the field of business analytics! You’ve also discovered the need to merge these two tables.

Currently we don’t know how to write code that would calculate which color of paint sells the most. However, notice that each table has a column with the paints’ serial number. We can utilize this to join the tables such that every time a serial number shows up in orders, we also include the columns from paints.

We accomplish this with the .merge method, and just need to specify the names of the shared column.

merged = orders.merge(paints, left_on='item', right_on='item_id')
merged

	order_number	item	sale	item_id	brand	color	weight
0	111	BEHR-N2	35.98	BEHR-N2	Behr	blue	11.4
1	112	BEHR-N2	31.08	BEHR-N2	Behr	blue	11.4
2	113	KM-D4	28.99	KM-D4	Kelley-Moore	red	8.9
3	114	SW-K3	35.98	SW-K3	Sherwin-Williams	green	10.0
4	115	VALS-N2	39.77	VALS-N2	Valspar	blue	9.2

In calling this method, we’re telling Babypandas to take a ‘left’ and ‘right’ table and merge the two tables based on shared values in a specific column.

Notice that an item only shows up in this merged table if it shows up in both tables. And an item can show up multiple times if either table has multiple entries for it.

Since this table join looks at both tables to decide whether to include a row, it doesn’t actually matter which table we use as the left versus the right. The only thing that will change is the order of our columns.

paints.merge(orders, left_on='item_id', right_on='item')

	item_id	brand	color	weight	order_number	item	sale
0	BEHR-N2	Behr	blue	11.4	111	BEHR-N2	35.98
1	BEHR-N2	Behr	blue	11.4	112	BEHR-N2	31.08
2	KM-D4	Kelley-Moore	red	8.9	113	KM-D4	28.99
3	SW-K3	Sherwin-Williams	green	10.0	114	SW-K3	35.98
4	VALS-N2	Valspar	blue	9.2	115	VALS-N2	39.77

With the table merge complete, we can easily perform calculations which rely on features from both orders and paints, such as the distribution of orders across each color of paint, or the average price-per-pound of the paints in our store.

What code could you write to plot the categorical distribution of orders of each paint color?

merged.groupby('color').count().plot(kind='bar', y='order_number')

If we’re following good table practices, we should really be setting the indices of our tables. So, orders would be indexed by order_number, and paints would be indexed by item_id.

orders = orders.set_index('order_number')
orders

	item	sale
order_number
111	BEHR-N2	35.98
112	BEHR-N2	31.08
113	KM-D4	28.99
114	SW-K3	35.98
115	VALS-N2	39.77

paints = paints.set_index('item_id')
paints

	brand	color	weight
item_id
BEHR-N2	Behr	blue	11.4
KM-D4	Kelley-Moore	red	8.9
SW-K3	Sherwin-Williams	green	10.0
SW-Q0	Sherwin-Williams	black	10.1
VALS-N2	Valspar	blue	9.2

In this case, we’d need to merge using the ‘item’ column from orders, but the index of paints. To merge on an index, instead of specifying left_on or right_on we just specify left_index=True or right_index=True.

orders.merge(paints, left_on='item', right_index=True)

	item	sale	brand	color	weight
order_number
111	BEHR-N2	35.98	Behr	blue	11.4
112	BEHR-N2	31.08	Behr	blue	11.4
113	KM-D4	28.99	Kelley-Moore	red	8.9
114	SW-K3	35.98	Sherwin-Williams	green	10.0
115	VALS-N2	39.77	Valspar	blue	9.2

Merging is an integral part of data management, and serves as an intermediate step in our analysis. By aggregating data from multiple tables into a single table, we can feel confident in our ability to work with organized data, or even pull in and combine data from various sources!

Notes on (Baby)Pandas

Merging

13. Merging¶