|
1 | 1 | {
|
2 | 2 | "cells": [
|
3 |
| - { |
4 |
| - "cell_type": "markdown", |
5 |
| - "metadata": {}, |
6 |
| - "source": [ |
7 |
| - "[](https://mybinder.org/v2/gh/treehouse-projects/python-introducing-pandas/master?filepath=s2n5-handling-missing-and-duplicated-data.ipynb)" |
8 |
| - ] |
9 |
| - }, |
10 | 3 | {
|
11 | 4 | "cell_type": "markdown",
|
12 | 5 | "metadata": {},
|
|
44 | 37 | "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n",
|
45 | 38 | "requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n",
|
46 | 39 | "\n",
|
47 |
| - "# Perform the merge from the previous notebook (s2n4-combining-dataframes.ipynb)\n", |
| 40 | + "# Perform the merge from the previous notebook (s2n6-combining-dataframes.ipynb)\n", |
48 | 41 | "successful_requests = requests.merge(\n",
|
49 | 42 | " transactions,\n",
|
50 | 43 | " left_on=['from_user', 'to_user', 'amount'], \n",
|
|
64 | 57 | "source": [
|
65 | 58 | "## Duplicated Data\n",
|
66 | 59 | "\n",
|
67 |
| - "We realized in our the previous notebook (s2n4-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n", |
| 60 | + "We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n", |
68 | 61 | "\n",
|
69 |
| - "We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. " |
| 62 | + "We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. \n", |
| 63 | + "\n", |
| 64 | + "Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. " |
70 | 65 | ]
|
71 | 66 | },
|
72 | 67 | {
|
|
88 | 83 | "source": [
|
89 | 84 | "# Let's get our records sorted chronologically\n",
|
90 | 85 | "successful_requests.sort_values('request_date', inplace=True) \n",
|
| 86 | + "\n", |
91 | 87 | "# And then we'll drop dupes keeping only the last one. Note the call to inplace \n",
|
92 | 88 | "successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)\n",
|
| 89 | + "\n", |
93 | 90 | "# Statement from previous notebook\n",
|
94 | 91 | "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n",
|
95 | 92 | " successful_requests.amount.sum(),\n",
|
|
363 | 360 | "source": [
|
364 | 361 | "## Locating Missing Data\n",
|
365 | 362 | "\n",
|
366 |
| - "As I was looking at these people who hadn't made requests I noticed that a few of them had a Not A Number (`np.nan`) for a **`last_name`**.\n", |
| 363 | + "As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a **`last_name`**.\n", |
367 | 364 | "\n",
|
368 | 365 | "We can get a quick overview of how many blank values we have by using the [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n"
|
369 | 366 | ]
|
|
529 | 526 | },
|
530 | 527 | {
|
531 | 528 | "cell_type": "code",
|
532 |
| - "execution_count": 7, |
| 529 | + "execution_count": 9, |
533 | 530 | "metadata": {},
|
534 | 531 | "outputs": [
|
535 | 532 | {
|
|
573 | 570 | "Index: []"
|
574 | 571 | ]
|
575 | 572 | },
|
576 |
| - "execution_count": 7, |
| 573 | + "execution_count": 9, |
577 | 574 | "metadata": {},
|
578 | 575 | "output_type": "execute_result"
|
579 | 576 | }
|
580 | 577 | ],
|
581 | 578 | "source": [
|
582 | 579 | "# Make a copy of the DataFrame with \"Unknown\" as the last name where it is missing\n",
|
583 | 580 | "users_with_unknown = users.fillna('Unknown')\n",
|
| 581 | + "\n", |
584 | 582 | "# Make sure we got 'em all\n",
|
585 | 583 | "users_with_unknown[users_with_unknown.last_name.isna()]"
|
586 | 584 | ]
|
|
598 | 596 | },
|
599 | 597 | {
|
600 | 598 | "cell_type": "code",
|
601 |
| - "execution_count": 9, |
| 599 | + "execution_count": 10, |
602 | 600 | "metadata": {},
|
603 | 601 | "outputs": [
|
604 | 602 | {
|
|
607 | 605 | "(475, 430)"
|
608 | 606 | ]
|
609 | 607 | },
|
610 |
| - "execution_count": 9, |
| 608 | + "execution_count": 10, |
611 | 609 | "metadata": {},
|
612 | 610 | "output_type": "execute_result"
|
613 | 611 | }
|
614 | 612 | ],
|
615 | 613 | "source": [
|
616 | 614 | "users_with_last_names = users.dropna()\n",
|
| 615 | + "\n", |
617 | 616 | "# Row counts of the original \n",
|
618 | 617 | "(len(users), len(users_with_last_names))"
|
619 | 618 | ]
|
|
0 commit comments