diff --git a/notebooks/8_combine_dataframes.ipynb b/notebooks/8_combine_dataframes.ipynb new file mode 100644 index 0000000..5c899ff --- /dev/null +++ b/notebooks/8_combine_dataframes.ipynb @@ -0,0 +1,1728 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Objectives\n", + "\n", + "- Combine two tables by concatenation\n", + "- Combine two tables by left join when tables share the same key column\n", + "- Combine two tables by left join by defining the columns to join on specifically\n", + "\n", + "Content to cover\n", + "\n", + "- pd.concat\n", + "- pd.merge\n" + ] + }, + { + "cell_type": "code", + "execution_count": 153, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Air quality data about $NO_2$ and particulate matter less than 2.5 micrometers is used, made available by [openaq](https://openaq.org) and using the [py-openaq](http://dhhagan.github.io/py-openaq/index.html) package:\n", + "\n", + "- The `air_quality_no2_long.csv\"` data set provides $NO_2$ values for the measurement stations _FR04014_, _BETR801_ and _London Westminster_ in respectively Paris, Antwerp and London. \n", + "- The `air_quality_pm25_long.csv` data set provides $pm25$ values for the measurement stations _FR04014_, _BETR801_ and _London Westminster_ in respectively Paris, Antwerp and London. \n", + "- The metadata about these stations is stored in a data file `air_quality_stations.csv`\n", + "- The metadata about the measured parameters is stored in a data file `air_quality_parameters.csv`" + ] + }, + { + "cell_type": "code", + "execution_count": 154, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunit
0ParisFR2019-06-21 00:00:00+00:00FR04014no220.0µg/m³
1ParisFR2019-06-20 23:00:00+00:00FR04014no221.8µg/m³
2ParisFR2019-06-20 22:00:00+00:00FR04014no226.5µg/m³
3ParisFR2019-06-20 21:00:00+00:00FR04014no224.9µg/m³
4ParisFR2019-06-20 20:00:00+00:00FR04014no221.4µg/m³
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter value unit\n", + "0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³\n", + "1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³\n", + "2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³\n", + "3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³\n", + "4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³" + ] + }, + "execution_count": 154, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_no2 = pd.read_csv(\"../data/air_quality_no2_long.csv\", \n", + " parse_dates=True)\n", + "air_quality_no2.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 155, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunit
0AntwerpenBE2019-06-18 06:00:00+00:00BETR801pm2518.0µg/m³
1AntwerpenBE2019-06-17 08:00:00+00:00BETR801pm256.5µg/m³
2AntwerpenBE2019-06-17 07:00:00+00:00BETR801pm2518.5µg/m³
3AntwerpenBE2019-06-17 06:00:00+00:00BETR801pm2516.0µg/m³
4AntwerpenBE2019-06-17 05:00:00+00:00BETR801pm257.5µg/m³
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter value \\\n", + "0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 \n", + "1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 \n", + "2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 \n", + "3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 \n", + "4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 \n", + "\n", + " unit \n", + "0 µg/m³ \n", + "1 µg/m³ \n", + "2 µg/m³ \n", + "3 µg/m³ \n", + "4 µg/m³ " + ] + }, + "execution_count": 155, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_pm25 = pd.read_csv(\"../data/air_quality_pm25_long.csv\", \n", + " parse_dates=True)\n", + "air_quality_pm25.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 156, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycoordinates.latitudecoordinates.longitudecountcountrydistancefirstUpdatedlastUpdatedlocationparameterssourceNamesourceNames
0Antwerpen51.2361954.3852244179BE59022932017-09-22 01:00:00+00:002019-08-05 03:00:00+00:00BELAL01['pm10', 'pm25']EEA Belgium['EEA Belgium']
1Antwerpen51.1703004.3410058052BE59024282017-09-22 01:00:00+00:002019-08-05 03:00:00+00:00BELHB23['so2', 'pm10', 'no2', 'pm25']EEA Belgium['EEA Belgium']
2Antwerpen51.1099785.0048641641BE59474802017-09-22 01:00:00+00:002019-01-09 01:00:00+00:00BELLD01['no2']EEA Belgium['EEA Belgium']
3Antwerpen51.1203845.0215461973BE59480672017-09-22 01:00:00+00:002019-08-05 03:00:00+00:00BELLD02['no2']EEA Belgium['EEA Belgium']
4Antwerpen51.3276604.3622611923BE58967362017-09-23 01:00:00+00:002019-08-05 03:00:00+00:00BELR833['no2']EEA Belgium['EEA Belgium']
\n", + "
" + ], + "text/plain": [ + " city coordinates.latitude coordinates.longitude count country \\\n", + "0 Antwerpen 51.236195 4.385224 4179 BE \n", + "1 Antwerpen 51.170300 4.341005 8052 BE \n", + "2 Antwerpen 51.109978 5.004864 1641 BE \n", + "3 Antwerpen 51.120384 5.021546 1973 BE \n", + "4 Antwerpen 51.327660 4.362261 1923 BE \n", + "\n", + " distance firstUpdated lastUpdated location \\\n", + "0 5902293 2017-09-22 01:00:00+00:00 2019-08-05 03:00:00+00:00 BELAL01 \n", + "1 5902428 2017-09-22 01:00:00+00:00 2019-08-05 03:00:00+00:00 BELHB23 \n", + "2 5947480 2017-09-22 01:00:00+00:00 2019-01-09 01:00:00+00:00 BELLD01 \n", + "3 5948067 2017-09-22 01:00:00+00:00 2019-08-05 03:00:00+00:00 BELLD02 \n", + "4 5896736 2017-09-23 01:00:00+00:00 2019-08-05 03:00:00+00:00 BELR833 \n", + "\n", + " parameters sourceName sourceNames \n", + "0 ['pm10', 'pm25'] EEA Belgium ['EEA Belgium'] \n", + "1 ['so2', 'pm10', 'no2', 'pm25'] EEA Belgium ['EEA Belgium'] \n", + "2 ['no2'] EEA Belgium ['EEA Belgium'] \n", + "3 ['no2'] EEA Belgium ['EEA Belgium'] \n", + "4 ['no2'] EEA Belgium ['EEA Belgium'] " + ] + }, + "execution_count": 156, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_stations = pd.read_csv(\"../data/air_quality_stations.csv\")\n", + "air_quality_stations.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 157, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
descriptionidnamepreferredUnit
0Black CarbonbcBCµg/m³
1Carbon MonoxidecoCOppm
2Nitrogen Dioxideno2NO2ppm
3Ozoneo3O3ppm
4Particulate matter less than 10 micrometers in...pm10PM10µg/m³
5Particulate matter less than 2.5 micrometers i...pm25PM2.5µg/m³
6Sulfur Dioxideso2SO2ppm
\n", + "
" + ], + "text/plain": [ + " description id name \\\n", + "0 Black Carbon bc BC \n", + "1 Carbon Monoxide co CO \n", + "2 Nitrogen Dioxide no2 NO2 \n", + "3 Ozone o3 O3 \n", + "4 Particulate matter less than 10 micrometers in... pm10 PM10 \n", + "5 Particulate matter less than 2.5 micrometers i... pm25 PM2.5 \n", + "6 Sulfur Dioxide so2 SO2 \n", + "\n", + " preferredUnit \n", + "0 µg/m³ \n", + "1 ppm \n", + "2 ppm \n", + "3 ppm \n", + "4 µg/m³ \n", + "5 µg/m³ \n", + "6 ppm " + ] + }, + "execution_count": 157, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_parameters = pd.read_csv(\"../data/air_quality_parameters.csv\")\n", + "air_quality_parameters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Combine data from multiple tables" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Concatenating objects" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](../schemas/08_concat_row.svg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I want to combine the measurements of $NO_2$ and $pm25$, two tables with a similar structure, in a single table" + ] + }, + { + "cell_type": "code", + "execution_count": 158, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunit
0AntwerpenBE2019-06-18 06:00:00+00:00BETR801pm2518.0µg/m³
1AntwerpenBE2019-06-17 08:00:00+00:00BETR801pm256.5µg/m³
2AntwerpenBE2019-06-17 07:00:00+00:00BETR801pm2518.5µg/m³
3AntwerpenBE2019-06-17 06:00:00+00:00BETR801pm2516.0µg/m³
4AntwerpenBE2019-06-17 05:00:00+00:00BETR801pm257.5µg/m³
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter value \\\n", + "0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 \n", + "1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 \n", + "2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 \n", + "3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 \n", + "4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 \n", + "\n", + " unit \n", + "0 µg/m³ \n", + "1 µg/m³ \n", + "2 µg/m³ \n", + "3 µg/m³ \n", + "4 µg/m³ " + ] + }, + "execution_count": 158, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality = pd.concat([air_quality_pm25, air_quality_no2])\n", + "air_quality.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `concat` function performs concatenation operations of multiple tables along one of the axis (row-wise or column-wise). By default concatenation is along axis 0, so the resulting table combines the rows of the input tables. Let's check the shape of the original and the concatenated tables to verify the operation:" + ] + }, + { + "cell_type": "code", + "execution_count": 159, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((1110, 7), (2068, 7), (3178, 7))" + ] + }, + "execution_count": 159, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_pm25.shape, air_quality_no2.shape, air_quality.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Hence, the resulting table has 3178 = 1110 + 2068 rows." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sorting the table on the datetime information illustrates also the combination of both tables, with the `parameter` column defining the origin of the table (either `no2` from table `air_quality_no2` or `pm25` from table `air_quality_pm25`):" + ] + }, + { + "cell_type": "code", + "execution_count": 160, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunit
2067LondonGB2019-05-07 01:00:00+00:00London Westminsterno223.0µg/m³
1003ParisFR2019-05-07 01:00:00+00:00FR04014no225.0µg/m³
100AntwerpenBE2019-05-07 01:00:00+00:00BETR801pm2512.5µg/m³
1098AntwerpenBE2019-05-07 01:00:00+00:00BETR801no250.5µg/m³
1109LondonGB2019-05-07 01:00:00+00:00London Westminsterpm258.0µg/m³
\n", + "
" + ], + "text/plain": [ + " city country date.utc location \\\n", + "2067 London GB 2019-05-07 01:00:00+00:00 London Westminster \n", + "1003 Paris FR 2019-05-07 01:00:00+00:00 FR04014 \n", + "100 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 \n", + "1098 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 \n", + "1109 London GB 2019-05-07 01:00:00+00:00 London Westminster \n", + "\n", + " parameter value unit \n", + "2067 no2 23.0 µg/m³ \n", + "1003 no2 25.0 µg/m³ \n", + "100 pm25 12.5 µg/m³ \n", + "1098 no2 50.5 µg/m³ \n", + "1109 pm25 8.0 µg/m³ " + ] + }, + "execution_count": 160, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality = air_quality.sort_values(\"date.utc\")\n", + "air_quality.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this specific example, the `parameter` column provided by the data ensures that each of the original tables can be identified. This is not always the case. the `concat` function provides a convenient solution with the `keys` argument, adding an additional (hierarchical) row index. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 161, + "metadata": {}, + "outputs": [], + "source": [ + "air_quality_ = pd.concat([air_quality_pm25, air_quality_no2], keys=[\"PM25\", \"NO2\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 162, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunit
PM250AntwerpenBE2019-06-18 06:00:00+00:00BETR801pm2518.0µg/m³
1AntwerpenBE2019-06-17 08:00:00+00:00BETR801pm256.5µg/m³
2AntwerpenBE2019-06-17 07:00:00+00:00BETR801pm2518.5µg/m³
3AntwerpenBE2019-06-17 06:00:00+00:00BETR801pm2516.0µg/m³
4AntwerpenBE2019-06-17 05:00:00+00:00BETR801pm257.5µg/m³
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter \\\n", + "PM25 0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 \n", + " 1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 \n", + " 2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 \n", + " 3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 \n", + " 4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 \n", + "\n", + " value unit \n", + "PM25 0 18.0 µg/m³ \n", + " 1 6.5 µg/m³ \n", + " 2 18.5 µg/m³ \n", + " 3 16.0 µg/m³ \n", + " 4 7.5 µg/m³ " + ] + }, + "execution_count": 162, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "__Note__: The existence of multiple row/column indices at the same time has not been mentioned within these tutorials. _Hierarchical indexing_ or _MultiIndex_ is an advanced and powerfull Pandas feature to analyze higher dimensional data. \n", + "\n", + "Multi-indexing is out of scope for this Pandas introduction. For the moment, remember that the function `reset_index` can be used to convert any level of an index to a column, e.g. `air_quality.reset_index(level=0)`\n", + " \n", + "__To user guide:__ Feel free to dive into the world of multi-indexing at :ref:`advanced`\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ More options on table concatenation (row and column wise) and how `concat` can be used to define the logic (union or intersection) of the indexes on the other axes \n", + "is provided at :ref:`merging.concat`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Join tables using a common identifier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](../schemas/08_merge_left.svg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Add the station coordinates, provided by the stations metadata table, to the corresponding rows in the measurements table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, extract the station location identifier and the coordinates from the `air_quality_stations` metadata table:" + ] + }, + { + "cell_type": "code", + "execution_count": 163, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
locationcoordinates.latitudecoordinates.longitude
0BELAL0151.2361954.385224
1BELHB2351.1703004.341005
2BELLD0151.1099785.004864
3BELLD0251.1203845.021546
4BELR83351.3276604.362261
\n", + "
" + ], + "text/plain": [ + " location coordinates.latitude coordinates.longitude\n", + "0 BELAL01 51.236195 4.385224\n", + "1 BELHB23 51.170300 4.341005\n", + "2 BELLD01 51.109978 5.004864\n", + "3 BELLD02 51.120384 5.021546\n", + "4 BELR833 51.327660 4.362261" + ] + }, + "execution_count": 163, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_stations_coord = air_quality_stations[[\"location\", \"coordinates.latitude\", \"coordinates.longitude\"]]\n", + "air_quality_stations_coord.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__Note:__ The stations used in this example (FR04014, BETR801 and London Westminster) are just three entries enlisted in the metadata table. We only want to add the coordinates of these three to the measurements table, each on the corresponding rows of the `air_quality` table." + ] + }, + { + "cell_type": "code", + "execution_count": 168, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunitcoordinates.latitudecoordinates.longitudeiddescriptionname
0LondonGB2019-05-07 01:00:00+00:00London Westminsterno223.0µg/m³51.494670-0.131931no2Nitrogen DioxideNO2
1ParisFR2019-05-07 01:00:00+00:00FR04014no225.0µg/m³48.8372422.393903no2Nitrogen DioxideNO2
2AntwerpenBE2019-05-07 01:00:00+00:00BETR801pm2512.5µg/m³51.2096634.431821pm25Particulate matter less than 2.5 micrometers i...PM2.5
3AntwerpenBE2019-05-07 01:00:00+00:00BETR801no250.5µg/m³51.2096634.431821no2Nitrogen DioxideNO2
4LondonGB2019-05-07 01:00:00+00:00London Westminsterpm258.0µg/m³51.494670-0.131931pm25Particulate matter less than 2.5 micrometers i...PM2.5
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter \\\n", + "0 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 \n", + "1 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 \n", + "2 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 pm25 \n", + "3 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 \n", + "4 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 \n", + "\n", + " value unit coordinates.latitude coordinates.longitude id \\\n", + "0 23.0 µg/m³ 51.494670 -0.131931 no2 \n", + "1 25.0 µg/m³ 48.837242 2.393903 no2 \n", + "2 12.5 µg/m³ 51.209663 4.431821 pm25 \n", + "3 50.5 µg/m³ 51.209663 4.431821 no2 \n", + "4 8.0 µg/m³ 51.494670 -0.131931 pm25 \n", + "\n", + " description name \n", + "0 Nitrogen Dioxide NO2 \n", + "1 Nitrogen Dioxide NO2 \n", + "2 Particulate matter less than 2.5 micrometers i... PM2.5 \n", + "3 Nitrogen Dioxide NO2 \n", + "4 Particulate matter less than 2.5 micrometers i... PM2.5 " + ] + }, + "execution_count": 168, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 165, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunitcoordinates.latitudecoordinates.longitude
0LondonGB2019-05-07 01:00:00+00:00London Westminsterno223.0µg/m³51.494670-0.131931
1ParisFR2019-05-07 01:00:00+00:00FR04014no225.0µg/m³48.8372422.393903
2AntwerpenBE2019-05-07 01:00:00+00:00BETR801pm2512.5µg/m³51.2096634.431821
3AntwerpenBE2019-05-07 01:00:00+00:00BETR801no250.5µg/m³51.2096634.431821
4LondonGB2019-05-07 01:00:00+00:00London Westminsterpm258.0µg/m³51.494670-0.131931
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter \\\n", + "0 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 \n", + "1 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 \n", + "2 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 pm25 \n", + "3 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 \n", + "4 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 \n", + "\n", + " value unit coordinates.latitude coordinates.longitude \n", + "0 23.0 µg/m³ 51.494670 -0.131931 \n", + "1 25.0 µg/m³ 48.837242 2.393903 \n", + "2 12.5 µg/m³ 51.209663 4.431821 \n", + "3 50.5 µg/m³ 51.209663 4.431821 \n", + "4 8.0 µg/m³ 51.494670 -0.131931 " + ] + }, + "execution_count": 165, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality = pd.merge(air_quality, air_quality_stations_coord, \n", + " how='left', on='location')\n", + "air_quality.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using the `merge` function, for each of the rows in the `air_quality` table, the corresponding coordinates are added from the `air_quality_stations_coord` table. Both tables have the column `location` in common which is used as a key to combine the information. By choosing the `left` join, only the locations available in the `air_quality` (left) table, i.e. FR04014, BETR801 and London Westminster, end up in the resulting table. The `merge` function supports multiple join options similar to database-style operations. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Add the parameter full description and name, provided by the parameters metadata table, to the measurements table" + ] + }, + { + "cell_type": "code", + "execution_count": 169, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
iddescriptionname
0bcBlack CarbonBC
1coCarbon MonoxideCO
2no2Nitrogen DioxideNO2
3o3OzoneO3
4pm10Particulate matter less than 10 micrometers in...PM10
\n", + "
" + ], + "text/plain": [ + " id description name\n", + "0 bc Black Carbon BC\n", + "1 co Carbon Monoxide CO\n", + "2 no2 Nitrogen Dioxide NO2\n", + "3 o3 Ozone O3\n", + "4 pm10 Particulate matter less than 10 micrometers in... PM10" + ] + }, + "execution_count": 169, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality_parameters_name = air_quality_parameters[['id','description', 'name']]\n", + "air_quality_parameters_name.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 170, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycountrydate.utclocationparametervalueunitcoordinates.latitudecoordinates.longitudeid_xdescription_xname_xid_ydescription_yname_y
0LondonGB2019-05-07 01:00:00+00:00London Westminsterno223.0µg/m³51.494670-0.131931no2Nitrogen DioxideNO2no2Nitrogen DioxideNO2
1ParisFR2019-05-07 01:00:00+00:00FR04014no225.0µg/m³48.8372422.393903no2Nitrogen DioxideNO2no2Nitrogen DioxideNO2
2AntwerpenBE2019-05-07 01:00:00+00:00BETR801pm2512.5µg/m³51.2096634.431821pm25Particulate matter less than 2.5 micrometers i...PM2.5pm25Particulate matter less than 2.5 micrometers i...PM2.5
3AntwerpenBE2019-05-07 01:00:00+00:00BETR801no250.5µg/m³51.2096634.431821no2Nitrogen DioxideNO2no2Nitrogen DioxideNO2
4LondonGB2019-05-07 01:00:00+00:00London Westminsterpm258.0µg/m³51.494670-0.131931pm25Particulate matter less than 2.5 micrometers i...PM2.5pm25Particulate matter less than 2.5 micrometers i...PM2.5
\n", + "
" + ], + "text/plain": [ + " city country date.utc location parameter \\\n", + "0 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 \n", + "1 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 \n", + "2 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 pm25 \n", + "3 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 \n", + "4 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 \n", + "\n", + " value unit coordinates.latitude coordinates.longitude id_x \\\n", + "0 23.0 µg/m³ 51.494670 -0.131931 no2 \n", + "1 25.0 µg/m³ 48.837242 2.393903 no2 \n", + "2 12.5 µg/m³ 51.209663 4.431821 pm25 \n", + "3 50.5 µg/m³ 51.209663 4.431821 no2 \n", + "4 8.0 µg/m³ 51.494670 -0.131931 pm25 \n", + "\n", + " description_x name_x id_y \\\n", + "0 Nitrogen Dioxide NO2 no2 \n", + "1 Nitrogen Dioxide NO2 no2 \n", + "2 Particulate matter less than 2.5 micrometers i... PM2.5 pm25 \n", + "3 Nitrogen Dioxide NO2 no2 \n", + "4 Particulate matter less than 2.5 micrometers i... PM2.5 pm25 \n", + "\n", + " description_y name_y \n", + "0 Nitrogen Dioxide NO2 \n", + "1 Nitrogen Dioxide NO2 \n", + "2 Particulate matter less than 2.5 micrometers i... PM2.5 \n", + "3 Nitrogen Dioxide NO2 \n", + "4 Particulate matter less than 2.5 micrometers i... PM2.5 " + ] + }, + "execution_count": 170, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "air_quality = pd.merge(air_quality, air_quality_parameters_name, \n", + " how='left', left_on='parameter', right_on='id')\n", + "air_quality.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Compared to the previous example, there is no common column name. However, the `parameter` column in the `air_quality` table and the `id` column in the `air_quality_parameters_name` both provide the measured variable in a common format. The `left_on` and `right_on` arguments are used here (instead of just `on`) to make the link between the two tables. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " __To user guide:__ Pandas supports also inner, outer, and right joins. More information on join/merge of tables is provided in :ref:`merging.join`. Or have a look to the :ref:`comparison with SQL`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## REMEMBER\n", + "\n", + "- Mulitple tables can be concatentated both column as row wise using the `concat` function.\n", + "- For database-like merging/joining of tables, use the `merge` function. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ The user guide provides more information on combining together data tables, see :ref:`merging`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}