diff --git a/notebooks/10_text_data.ipynb b/notebooks/10_text_data.ipynb new file mode 100644 index 0000000..6df7f80 --- /dev/null +++ b/notebooks/10_text_data.ipynb @@ -0,0 +1,680 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Objectives\n", + "- Use string methods to manipulate textual data columns\n", + "- Use string methods to select rows of interest (boolean indexing)\n", + "- replace as general function for mappings\n", + "\n", + "Content to cover\n", + "- str.upper/…\n", + "- df[df[].str.contains()]\n", + "- replace\n" + ] + }, + { + "cell_type": "code", + "execution_count": 139, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": 140, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "2 3 1 3 \n", + "3 4 1 1 \n", + "4 5 0 3 \n", + "\n", + " Name Sex Age SibSp \\\n", + "0 Braund, Mr. Owen Harris male 22.0 1 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", + "2 Heikkinen, Miss. Laina female 26.0 0 \n", + "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", + "4 Allen, Mr. William Henry male 35.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "0 0 A/5 21171 7.2500 NaN S \n", + "1 0 PC 17599 71.2833 C85 C \n", + "2 0 STON/O2. 3101282 7.9250 NaN S \n", + "3 0 113803 53.1000 C123 S \n", + "4 0 373450 8.0500 NaN S " + ] + }, + "execution_count": 140, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic = pd.read_csv(\"../data/titanic.csv\")\n", + "titanic.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manipulate data columns with text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Make all name characters lower case" + ] + }, + { + "cell_type": "code", + "execution_count": 141, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 braund, mr. owen harris\n", + "1 cumings, mrs. john bradley (florence briggs th...\n", + "2 heikkinen, miss. laina\n", + "3 futrelle, mrs. jacques heath (lily may peel)\n", + "4 allen, mr. william henry\n", + " ... \n", + "886 montvila, rev. juozas\n", + "887 graham, miss. margaret edith\n", + "888 johnston, miss. catherine helen \"carrie\"\n", + "889 behr, mr. karl howell\n", + "890 dooley, mr. patrick\n", + "Name: Name, Length: 891, dtype: object" + ] + }, + "execution_count": 141, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Name\"].str.lower()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similar to datetime objects in the [time series tutorial](9_timeseries.ipynb) having a `dt` accessor, a number of specialized string methods are available when using the `str` accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember [element wise calculations from tutorial 5?](5_add_columns.ipynb)) on each of the values of the columns. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Create a new column 'Surname' that contains the surname of the Passengers by extracting the part before the comma." + ] + }, + { + "cell_type": "code", + "execution_count": 142, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 [Braund, Mr. Owen Harris]\n", + "1 [Cumings, Mrs. John Bradley (Florence Briggs ...\n", + "2 [Heikkinen, Miss. Laina]\n", + "3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]\n", + "4 [Allen, Mr. William Henry]\n", + " ... \n", + "886 [Montvila, Rev. Juozas]\n", + "887 [Graham, Miss. Margaret Edith]\n", + "888 [Johnston, Miss. Catherine Helen \"Carrie\"]\n", + "889 [Behr, Mr. Karl Howell]\n", + "890 [Dooley, Mr. Patrick]\n", + "Name: Name, Length: 891, dtype: object" + ] + }, + "execution_count": 142, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Name\"].str.split(\",\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using the `split` method, each of the values is returned as a list of 2 elements. The first element is the part before the comma and the second element the part after the comma. " + ] + }, + { + "cell_type": "code", + "execution_count": 143, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 Braund\n", + "1 Cumings\n", + "2 Heikkinen\n", + "3 Futrelle\n", + "4 Allen\n", + " ... \n", + "886 Montvila\n", + "887 Graham\n", + "888 Johnston\n", + "889 Behr\n", + "890 Dooley\n", + "Name: Surname, Length: 891, dtype: object" + ] + }, + "execution_count": 143, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Surname\"] = titanic[\"Name\"].str.split(\",\").str.get(0)\n", + "titanic[\"Surname\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we are only interested in the first part representing the surname (element 0), we can again use the `str` accessor and apply `get` to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ More information on extracting parts of strings is available in :ref:`text.split`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Extract the passenger data about the Countess on board of the Titanic. " + ] + }, + { + "cell_type": "code", + "execution_count": 144, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 False\n", + "4 False\n", + " ... \n", + "886 False\n", + "887 False\n", + "888 False\n", + "889 False\n", + "890 False\n", + "Name: Name, Length: 891, dtype: bool" + ] + }, + "execution_count": 144, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Name\"].str.contains(\"Countess\")" + ] + }, + { + "cell_type": "code", + "execution_count": 145, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurname
75976011Rothes, the Countess. of (Lucy Noel Martha Dye...female33.00011015286.5B77SRothes
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "759 760 1 1 \n", + "\n", + " Name Sex Age SibSp \\\n", + "759 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked Surname \n", + "759 0 110152 86.5 B77 S Rothes " + ] + }, + "execution_count": 145, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[titanic[\"Name\"].str.contains(\"Countess\")]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(_Interested in her story? See [Wikipedia](https://en.wikipedia.org/wiki/No%C3%ABl_Leslie,_Countess_of_Rothes)!_)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The string method `contains` checks for each of the values in the column `Name` if the string contains the word `Countess` and returns for each of the values `True` (`Countess` is part of the name) of `False` (`Countess` is notpart of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the [subsetting of data tutorial](subset_data.ipynb). As there was only 1 Countess on the Titanic, we get one row as a result." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "__Note__: More powerfull extractions on strings is supported, as the `contains` and `extract` methods accepts [regular expressions](https://docs.python.org/3/library/re.html), but out of scope of this tutorial.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ More information on extracting parts of strings is available in :ref:`text.extract`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Which passenger of the titanic has the longest name?" + ] + }, + { + "cell_type": "code", + "execution_count": 146, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 23\n", + "1 51\n", + "2 22\n", + "3 44\n", + "4 24\n", + " ..\n", + "886 21\n", + "887 28\n", + "888 40\n", + "889 21\n", + "890 19\n", + "Name: Name, Length: 891, dtype: int64" + ] + }, + "execution_count": 146, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Name\"].str.len()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get the longest name we first have to get the lenghts of each of the names in the `Name` column. By using Pandas string methods, the `len` function is applied to each of the names individually (element-wise). " + ] + }, + { + "cell_type": "code", + "execution_count": 147, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "307" + ] + }, + "execution_count": 147, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Name\"].str.len().idxmax()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we need to get the corresponding location, prefereably the index name, in the table for which the name length is the largest. The `idxmax` method does exactly that. It is not a string method and is applied to integers, so no `str` is used." + ] + }, + { + "cell_type": "code", + "execution_count": 148, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'" + ] + }, + "execution_count": 148, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic.loc[titanic[\"Name\"].str.len().idxmax(), \"Name\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Based on the index name of the row (`307`) and the column (`Name`), we can do a selection using the `loc` operator, introduced in the [tutorial on subsetting](3_subset_data.ipynb)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> In the 'Sex' columns, replace values of 'male' by 'M' and all 'female' values by 'F'" + ] + }, + { + "cell_type": "code", + "execution_count": 149, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 M\n", + "1 F\n", + "2 F\n", + "3 F\n", + "4 M\n", + " ..\n", + "886 M\n", + "887 F\n", + "888 F\n", + "889 M\n", + "890 M\n", + "Name: Sex_short, Length: 891, dtype: object" + ] + }, + "execution_count": 149, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic[\"Sex_short\"] = titanic[\"Sex\"].replace({\"male\": \"M\", \n", + " \"female\": \"F\"})\n", + "titanic[\"Sex_short\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Whereas `replace` is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a `dictionary` to define the mapping `{from : to}`. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "__Note__: There is also a `str.replace` methods available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:\n", + "\n", + " titanic[\"Sex_short\"] = titanic[\"Sex\"].str.replace(\"female\", \"F\")\n", + " titanic[\"Sex_short\"] = titanic[\"Sex_short\"].str.replace(\"male\", \"M\")\n", + "\n", + "This would become cumbersome and easily lead to mistakes. Just think (or try out) yoursel what would happen if those two statements are applied in the opposite order...\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## REMEMBER\n", + "\n", + "- String methods are available using the `str` accessor.\n", + "- String methods work element wise and can be used for conditional indexing.\n", + "- The `replace` method is a convenient method to convert values according to a given dictionary." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ More information on string methods is given in :ref:`text`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}