diff --git a/notebooks/10_text_data.ipynb b/notebooks/10_text_data.ipynb
new file mode 100644
index 0000000..6df7f80
--- /dev/null
+++ b/notebooks/10_text_data.ipynb
@@ -0,0 +1,680 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Objectives\n",
+ "- Use string methods to manipulate textual data columns\n",
+ "- Use string methods to select rows of interest (boolean indexing)\n",
+ "- replace as general function for mappings\n",
+ "\n",
+ "Content to cover\n",
+ "- str.upper/…\n",
+ "- df[df[].str.contains()]\n",
+ "- replace\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 139,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 140,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " PassengerId | \n",
+ " Survived | \n",
+ " Pclass | \n",
+ " Name | \n",
+ " Sex | \n",
+ " Age | \n",
+ " SibSp | \n",
+ " Parch | \n",
+ " Ticket | \n",
+ " Fare | \n",
+ " Cabin | \n",
+ " Embarked | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 3 | \n",
+ " Braund, Mr. Owen Harris | \n",
+ " male | \n",
+ " 22.0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " A/5 21171 | \n",
+ " 7.2500 | \n",
+ " NaN | \n",
+ " S | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
+ " female | \n",
+ " 38.0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " PC 17599 | \n",
+ " 71.2833 | \n",
+ " C85 | \n",
+ " C | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 3 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " Heikkinen, Miss. Laina | \n",
+ " female | \n",
+ " 26.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " STON/O2. 3101282 | \n",
+ " 7.9250 | \n",
+ " NaN | \n",
+ " S | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
+ " female | \n",
+ " 35.0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 113803 | \n",
+ " 53.1000 | \n",
+ " C123 | \n",
+ " S | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5 | \n",
+ " 0 | \n",
+ " 3 | \n",
+ " Allen, Mr. William Henry | \n",
+ " male | \n",
+ " 35.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 373450 | \n",
+ " 8.0500 | \n",
+ " NaN | \n",
+ " S | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " PassengerId Survived Pclass \\\n",
+ "0 1 0 3 \n",
+ "1 2 1 1 \n",
+ "2 3 1 3 \n",
+ "3 4 1 1 \n",
+ "4 5 0 3 \n",
+ "\n",
+ " Name Sex Age SibSp \\\n",
+ "0 Braund, Mr. Owen Harris male 22.0 1 \n",
+ "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
+ "2 Heikkinen, Miss. Laina female 26.0 0 \n",
+ "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
+ "4 Allen, Mr. William Henry male 35.0 0 \n",
+ "\n",
+ " Parch Ticket Fare Cabin Embarked \n",
+ "0 0 A/5 21171 7.2500 NaN S \n",
+ "1 0 PC 17599 71.2833 C85 C \n",
+ "2 0 STON/O2. 3101282 7.9250 NaN S \n",
+ "3 0 113803 53.1000 C123 S \n",
+ "4 0 373450 8.0500 NaN S "
+ ]
+ },
+ "execution_count": 140,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic = pd.read_csv(\"../data/titanic.csv\")\n",
+ "titanic.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Manipulate data columns with text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> Make all name characters lower case"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 141,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 braund, mr. owen harris\n",
+ "1 cumings, mrs. john bradley (florence briggs th...\n",
+ "2 heikkinen, miss. laina\n",
+ "3 futrelle, mrs. jacques heath (lily may peel)\n",
+ "4 allen, mr. william henry\n",
+ " ... \n",
+ "886 montvila, rev. juozas\n",
+ "887 graham, miss. margaret edith\n",
+ "888 johnston, miss. catherine helen \"carrie\"\n",
+ "889 behr, mr. karl howell\n",
+ "890 dooley, mr. patrick\n",
+ "Name: Name, Length: 891, dtype: object"
+ ]
+ },
+ "execution_count": 141,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Name\"].str.lower()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Similar to datetime objects in the [time series tutorial](9_timeseries.ipynb) having a `dt` accessor, a number of specialized string methods are available when using the `str` accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember [element wise calculations from tutorial 5?](5_add_columns.ipynb)) on each of the values of the columns. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> Create a new column 'Surname' that contains the surname of the Passengers by extracting the part before the comma."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 142,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 [Braund, Mr. Owen Harris]\n",
+ "1 [Cumings, Mrs. John Bradley (Florence Briggs ...\n",
+ "2 [Heikkinen, Miss. Laina]\n",
+ "3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]\n",
+ "4 [Allen, Mr. William Henry]\n",
+ " ... \n",
+ "886 [Montvila, Rev. Juozas]\n",
+ "887 [Graham, Miss. Margaret Edith]\n",
+ "888 [Johnston, Miss. Catherine Helen \"Carrie\"]\n",
+ "889 [Behr, Mr. Karl Howell]\n",
+ "890 [Dooley, Mr. Patrick]\n",
+ "Name: Name, Length: 891, dtype: object"
+ ]
+ },
+ "execution_count": 142,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Name\"].str.split(\",\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using the `split` method, each of the values is returned as a list of 2 elements. The first element is the part before the comma and the second element the part after the comma. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 143,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 Braund\n",
+ "1 Cumings\n",
+ "2 Heikkinen\n",
+ "3 Futrelle\n",
+ "4 Allen\n",
+ " ... \n",
+ "886 Montvila\n",
+ "887 Graham\n",
+ "888 Johnston\n",
+ "889 Behr\n",
+ "890 Dooley\n",
+ "Name: Surname, Length: 891, dtype: object"
+ ]
+ },
+ "execution_count": 143,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Surname\"] = titanic[\"Name\"].str.split(\",\").str.get(0)\n",
+ "titanic[\"Surname\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As we are only interested in the first part representing the surname (element 0), we can again use the `str` accessor and apply `get` to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__To user guide:__ More information on extracting parts of strings is available in :ref:`text.split`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> Extract the passenger data about the Countess on board of the Titanic. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 144,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 False\n",
+ "3 False\n",
+ "4 False\n",
+ " ... \n",
+ "886 False\n",
+ "887 False\n",
+ "888 False\n",
+ "889 False\n",
+ "890 False\n",
+ "Name: Name, Length: 891, dtype: bool"
+ ]
+ },
+ "execution_count": 144,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Name\"].str.contains(\"Countess\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 145,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " PassengerId | \n",
+ " Survived | \n",
+ " Pclass | \n",
+ " Name | \n",
+ " Sex | \n",
+ " Age | \n",
+ " SibSp | \n",
+ " Parch | \n",
+ " Ticket | \n",
+ " Fare | \n",
+ " Cabin | \n",
+ " Embarked | \n",
+ " Surname | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 759 | \n",
+ " 760 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " Rothes, the Countess. of (Lucy Noel Martha Dye... | \n",
+ " female | \n",
+ " 33.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 110152 | \n",
+ " 86.5 | \n",
+ " B77 | \n",
+ " S | \n",
+ " Rothes | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " PassengerId Survived Pclass \\\n",
+ "759 760 1 1 \n",
+ "\n",
+ " Name Sex Age SibSp \\\n",
+ "759 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 \n",
+ "\n",
+ " Parch Ticket Fare Cabin Embarked Surname \n",
+ "759 0 110152 86.5 B77 S Rothes "
+ ]
+ },
+ "execution_count": 145,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[titanic[\"Name\"].str.contains(\"Countess\")]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "(_Interested in her story? See [Wikipedia](https://en.wikipedia.org/wiki/No%C3%ABl_Leslie,_Countess_of_Rothes)!_)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The string method `contains` checks for each of the values in the column `Name` if the string contains the word `Countess` and returns for each of the values `True` (`Countess` is part of the name) of `False` (`Countess` is notpart of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the [subsetting of data tutorial](subset_data.ipynb). As there was only 1 Countess on the Titanic, we get one row as a result."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " \n",
+ "__Note__: More powerfull extractions on strings is supported, as the `contains` and `extract` methods accepts [regular expressions](https://docs.python.org/3/library/re.html), but out of scope of this tutorial.\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__To user guide:__ More information on extracting parts of strings is available in :ref:`text.extract`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> Which passenger of the titanic has the longest name?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 146,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 23\n",
+ "1 51\n",
+ "2 22\n",
+ "3 44\n",
+ "4 24\n",
+ " ..\n",
+ "886 21\n",
+ "887 28\n",
+ "888 40\n",
+ "889 21\n",
+ "890 19\n",
+ "Name: Name, Length: 891, dtype: int64"
+ ]
+ },
+ "execution_count": 146,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Name\"].str.len()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To get the longest name we first have to get the lenghts of each of the names in the `Name` column. By using Pandas string methods, the `len` function is applied to each of the names individually (element-wise). "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 147,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "307"
+ ]
+ },
+ "execution_count": 147,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Name\"].str.len().idxmax()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we need to get the corresponding location, prefereably the index name, in the table for which the name length is the largest. The `idxmax` method does exactly that. It is not a string method and is applied to integers, so no `str` is used."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 148,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'"
+ ]
+ },
+ "execution_count": 148,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic.loc[titanic[\"Name\"].str.len().idxmax(), \"Name\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Based on the index name of the row (`307`) and the column (`Name`), we can do a selection using the `loc` operator, introduced in the [tutorial on subsetting](3_subset_data.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> In the 'Sex' columns, replace values of 'male' by 'M' and all 'female' values by 'F'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 149,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 M\n",
+ "1 F\n",
+ "2 F\n",
+ "3 F\n",
+ "4 M\n",
+ " ..\n",
+ "886 M\n",
+ "887 F\n",
+ "888 F\n",
+ "889 M\n",
+ "890 M\n",
+ "Name: Sex_short, Length: 891, dtype: object"
+ ]
+ },
+ "execution_count": 149,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "titanic[\"Sex_short\"] = titanic[\"Sex\"].replace({\"male\": \"M\", \n",
+ " \"female\": \"F\"})\n",
+ "titanic[\"Sex_short\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Whereas `replace` is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a `dictionary` to define the mapping `{from : to}`. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " \n",
+ "__Note__: There is also a `str.replace` methods available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:\n",
+ "\n",
+ " titanic[\"Sex_short\"] = titanic[\"Sex\"].str.replace(\"female\", \"F\")\n",
+ " titanic[\"Sex_short\"] = titanic[\"Sex_short\"].str.replace(\"male\", \"M\")\n",
+ "\n",
+ "This would become cumbersome and easily lead to mistakes. Just think (or try out) yoursel what would happen if those two statements are applied in the opposite order...\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## REMEMBER\n",
+ "\n",
+ "- String methods are available using the `str` accessor.\n",
+ "- String methods work element wise and can be used for conditional indexing.\n",
+ "- The `replace` method is a convenient method to convert values according to a given dictionary."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__To user guide:__ More information on string methods is given in :ref:`text`."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}