Skip to content

Commit f87fb1d

Browse files
author
craigsdennis
committed
Review pass
1 parent 90e58f4 commit f87fb1d

File tree

1 file changed

+14
-15
lines changed

1 file changed

+14
-15
lines changed

s2n7-handling-missing-and-duplicated-data.ipynb

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,5 @@
11
{
22
"cells": [
3-
{
4-
"cell_type": "markdown",
5-
"metadata": {},
6-
"source": [
7-
"[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/treehouse-projects/python-introducing-pandas/master?filepath=s2n5-handling-missing-and-duplicated-data.ipynb)"
8-
]
9-
},
103
{
114
"cell_type": "markdown",
125
"metadata": {},
@@ -44,7 +37,7 @@
4437
"transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n",
4538
"requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n",
4639
"\n",
47-
"# Perform the merge from the previous notebook (s2n4-combining-dataframes.ipynb)\n",
40+
"# Perform the merge from the previous notebook (s2n6-combining-dataframes.ipynb)\n",
4841
"successful_requests = requests.merge(\n",
4942
" transactions,\n",
5043
" left_on=['from_user', 'to_user', 'amount'], \n",
@@ -64,9 +57,11 @@
6457
"source": [
6558
"## Duplicated Data\n",
6659
"\n",
67-
"We realized in our the previous notebook (s2n4-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n",
60+
"We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n",
6861
"\n",
69-
"We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. "
62+
"We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. \n",
63+
"\n",
64+
"Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. "
7065
]
7166
},
7267
{
@@ -88,8 +83,10 @@
8883
"source": [
8984
"# Let's get our records sorted chronologically\n",
9085
"successful_requests.sort_values('request_date', inplace=True) \n",
86+
"\n",
9187
"# And then we'll drop dupes keeping only the last one. Note the call to inplace \n",
9288
"successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)\n",
89+
"\n",
9390
"# Statement from previous notebook\n",
9491
"\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n",
9592
" successful_requests.amount.sum(),\n",
@@ -363,7 +360,7 @@
363360
"source": [
364361
"## Locating Missing Data\n",
365362
"\n",
366-
"As I was looking at these people who hadn't made requests I noticed that a few of them had a Not A Number (`np.nan`) for a **`last_name`**.\n",
363+
"As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a **`last_name`**.\n",
367364
"\n",
368365
"We can get a quick overview of how many blank values we have by using the [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n"
369366
]
@@ -529,7 +526,7 @@
529526
},
530527
{
531528
"cell_type": "code",
532-
"execution_count": 7,
529+
"execution_count": 9,
533530
"metadata": {},
534531
"outputs": [
535532
{
@@ -573,14 +570,15 @@
573570
"Index: []"
574571
]
575572
},
576-
"execution_count": 7,
573+
"execution_count": 9,
577574
"metadata": {},
578575
"output_type": "execute_result"
579576
}
580577
],
581578
"source": [
582579
"# Make a copy of the DataFrame with \"Unknown\" as the last name where it is missing\n",
583580
"users_with_unknown = users.fillna('Unknown')\n",
581+
"\n",
584582
"# Make sure we got 'em all\n",
585583
"users_with_unknown[users_with_unknown.last_name.isna()]"
586584
]
@@ -598,7 +596,7 @@
598596
},
599597
{
600598
"cell_type": "code",
601-
"execution_count": 9,
599+
"execution_count": 10,
602600
"metadata": {},
603601
"outputs": [
604602
{
@@ -607,13 +605,14 @@
607605
"(475, 430)"
608606
]
609607
},
610-
"execution_count": 9,
608+
"execution_count": 10,
611609
"metadata": {},
612610
"output_type": "execute_result"
613611
}
614612
],
615613
"source": [
616614
"users_with_last_names = users.dropna()\n",
615+
"\n",
617616
"# Row counts of the original \n",
618617
"(len(users), len(users_with_last_names))"
619618
]

0 commit comments

Comments
 (0)