Skip to content

Enhancement Add max_level param to json_normalize #26876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
cb53be7
ENH add max_level and ignore_keys configuration to nested_to_records
bhavaniravi Nov 22, 2018
0972746
ENH extend max_level and ignore keys to
bhavaniravi Nov 22, 2018
5a5c708
fix pep8 issues
bhavaniravi Nov 22, 2018
be7ec0e
add whatsnew to doc string
bhavaniravi Nov 22, 2018
a79e126
add testcase with large max_level
bhavaniravi Nov 23, 2018
cd12a23
add explation for flatten if condition
bhavaniravi Nov 23, 2018
d3b3503
update doc_string and built documentation
bhavaniravi Nov 23, 2018
4ec60bc
fix json normalize records path issue
bhavaniravi Nov 27, 2018
e001264
Merge branch 'master' into enhanced_json_normalize
bhavaniravi Nov 27, 2018
5c88339
Merge branch 'master' of git://github.com/pandas-dev/pandas into json…
bhavaniravi Dec 30, 2018
55f7b1c
fix merge conflict
bhavaniravi Jan 3, 2019
1af2bfc
fix testcase error
bhavaniravi Jan 3, 2019
882a2ca
add nested flattening to json_normalize
bhavaniravi Jan 3, 2019
caba6db
fixed pep8 issues
bhavaniravi Jan 3, 2019
4e22c69
fix merge conflict
bhavaniravi Jan 3, 2019
c2eff85
fix issues with doc string
bhavaniravi Jan 4, 2019
247124f
modify test case to paramaetized
bhavaniravi Jan 4, 2019
ab15869
fix issues with pep8
bhavaniravi Jan 10, 2019
26bf967
fix pep8 build fail
bhavaniravi Jan 16, 2019
fca2a27
fix testcase failure, inconsistent column order
bhavaniravi Feb 5, 2019
7a58456
fix documentation issues
bhavaniravi Mar 19, 2019
f3d25e3
fix merge conflicts with upstream
bhavaniravi Mar 19, 2019
7a1297d
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 20, 2019
177c750
fix testcase failure np.nan converted into str on line 328
bhavaniravi Apr 20, 2019
cb82bca
remove get_pip file
bhavaniravi Apr 20, 2019
2a7b966
rename test func test_max_level_with_record_prefix
bhavaniravi Apr 20, 2019
4635591
fix pep8 over-intended line
bhavaniravi Apr 21, 2019
22fd84e
fix docstring formatting issues
bhavaniravi Apr 21, 2019
2e407e3
convert to a fixture
bhavaniravi Apr 21, 2019
cf27cae
convert to inline data
bhavaniravi Apr 21, 2019
124fbd9
fix docstring formatting issues
bhavaniravi Apr 21, 2019
7b65999
fix docstring formatting issues
bhavaniravi Apr 21, 2019
03d3d23
add github issue id to test case
bhavaniravi Apr 22, 2019
8e61a04
fix pep8 flake issues
bhavaniravi Apr 22, 2019
b808d5a
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 22, 2019
0eaea30
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 23, 2019
837ba18
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 26, 2019
217d4ae
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 30, 2019
b2fc133
seperate ignore keys and max level implementation
bhavaniravi Jun 16, 2019
acf1137
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Jun 16, 2019
ff30152
fix errors on validating doc strings
bhavaniravi Jun 16, 2019
fa2ecee
fix flake8 issues on test cases
bhavaniravi Jun 16, 2019
aed2db5
remove unwanted file
bhavaniravi Jun 16, 2019
f5dacd6
move fixture to the calling function
bhavaniravi Jun 17, 2019
33e2504
fix pep8 issue
bhavaniravi Jun 17, 2019
699d696
type annotated methods
bhavaniravi Jun 17, 2019
a91f27a
update docs based on review comments
bhavaniravi Jun 17, 2019
0a04cdb
fix flake8 issue E252
bhavaniravi Jun 19, 2019
e4e586d
add issue id to testcases
bhavaniravi Jun 19, 2019
62a35db
fix syntax issue with respect to typing in 3.5
bhavaniravi Jun 19, 2019
d113401
fix docstring and typing issue
bhavaniravi Jun 20, 2019
bfa62cf
fix typing linting issue
bhavaniravi Jun 20, 2019
53b6bcb
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Jun 20, 2019
2bc829b
updated typing from defaults to typing specific structures
bhavaniravi Jun 23, 2019
d6a7cc7
merged max_level testcases into a parametrized tests
bhavaniravi Jun 23, 2019
a69ad2b
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Jun 23, 2019
b0133d2
fix documentation issue
bhavaniravi Jun 23, 2019
1564d49
updated typing from defaults to typing specific structures
bhavaniravi Jun 23, 2019
463adc7
move expected data as parameters of test
bhavaniravi Jun 23, 2019
bbf894a
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Jun 23, 2019
775472e
checks fix and parameteize test case
bhavaniravi Jun 25, 2019
f8f550a
fix pep8 spacing issue
bhavaniravi Jun 25, 2019
4ddf0cc
resolve merge conflict
bhavaniravi Jun 25, 2019
9f2d356
fix import sort issue
bhavaniravi Jun 25, 2019
f3ff665
fix type checking and documentation error
bhavaniravi Jun 26, 2019
20432ad
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Jun 26, 2019
311b898
parametrize none test case, generalized typing
bhavaniravi Jun 27, 2019
7850db7
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Jun 27, 2019
676c7f1
remove test.json
bhavaniravi Jun 27, 2019
e2796d4
split large_max_level test case into a seperate function
bhavaniravi Jun 28, 2019
69a0d43
add doc string
bhavaniravi Jun 28, 2019
3a80a4d
add docs to io.rst and whatsnew
bhavaniravi Jun 28, 2019
c288c2e
add missing import
bhavaniravi Jun 28, 2019
f96d8fb
fix typing issue and docs update
bhavaniravi Jun 29, 2019
3ec85bf
fix typing check issues
bhavaniravi Jun 29, 2019
ba1d983
liniting issues fix in documentation
bhavaniravi Jun 29, 2019
4b754a0
fix liniting issue
bhavaniravi Jun 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2186,6 +2186,19 @@ into a flat table.

json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

The max_level parameter provides more control over which level to end normalization.
With max_level=1 the following snippet normalizes until 1st nesting level of the provided dict.

.. ipython:: python

data = [{'CreatedBy': {'Name': 'User001'},
'Lookup': {'TextField': 'Some text',
'UserField': {'Id': 'ID001',
'Name': 'Name001'}},
'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)

.. _io.jsonl:

Line delimited json
Expand Down
23 changes: 23 additions & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,29 @@ the output will truncate, if it's wider than :attr:`options.display.width`
(default: 80 characters).


.. _whatsnew_0250.enhancements.json_normalize_with_max_level:

Json normalize with max_level param support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`json_normalize` normalizes the provided input dict to all
nested levels. The new max_level parameter provides more control over
which level to end normalization (:issue:`23843`):

The repr now looks like this:

.. ipython:: python

from pandas.io.json import json_normalize
data = [{
'CreatedBy': {'Name': 'User001'},
'Lookup': {'TextField': 'Some text',
'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)


.. _whatsnew_0250.enhancements.other:

Other Enhancements
Expand Down
147 changes: 98 additions & 49 deletions pandas/io/json/normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from collections import defaultdict
import copy
from typing import DefaultDict, Dict, List, Optional, Union

import numpy as np

Expand All @@ -25,9 +26,11 @@ def _convert_to_line_delimits(s):
return convert_json_to_lines(s)


def nested_to_record(ds, prefix="", sep=".", level=0):
def nested_to_record(ds, prefix: str = "",
sep: str = ".", level: int = 0,
max_level: Optional[int] = None):
"""
A simplified json_normalize.
A simplified json_normalize

Converts a nested dict into a flat dict ("record"), unlike json_normalize,
it does not attempt to extract a subset of the data.
Expand All @@ -36,13 +39,19 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
----------
ds : dict or list of dicts
prefix: the prefix, optional, default: ""
sep : string, default '.'
sep : str, default '.'
Nested records will generate names separated by sep,
e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar

.. versionadded:: 0.20.0

level: the number of levels in the jason string, optional, default: 0
level: int, optional, default: 0
The number of levels in the json string.

max_level: int, optional, default: None
The max depth to normalize.

.. versionadded:: 0.25.0

Returns
-------
Expand All @@ -65,10 +74,8 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
if isinstance(ds, dict):
ds = [ds]
singleton = True

new_ds = []
for d in ds:

new_d = copy.deepcopy(d)
for k, v in d.items():
# each key gets renamed with prefix
Expand All @@ -79,62 +86,79 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
else:
newkey = prefix + sep + k

# flatten if type is dict and
# current dict level < maximum level provided and
# only dicts gets recurse-flattened
# only at level>1 do we rename the rest of the keys
if not isinstance(v, dict):
if (not isinstance(v, dict) or
(max_level is not None and level >= max_level)):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
continue
else:
v = new_d.pop(k)
new_d.update(nested_to_record(v, newkey, sep, level + 1))
new_d.update(nested_to_record(v, newkey, sep, level + 1,
max_level))
new_ds.append(new_d)

if singleton:
return new_ds[0]
return new_ds


def json_normalize(data, record_path=None, meta=None,
meta_prefix=None,
record_prefix=None,
errors='raise',
sep='.'):
def json_normalize(data: List[Dict],
record_path: Optional[Union[str, List]] = None,
meta: Optional[Union[str, List]] = None,
meta_prefix: Optional[str] = None,
record_prefix: Optional[str] = None,
errors: Optional[str] = 'raise',
sep: str = '.',
max_level: Optional[int] = None):
"""
Normalize semi-structured JSON data into a flat table.

Parameters
----------
data : dict or list of dicts
Unserialized JSON objects
record_path : string or list of strings, default None
Unserialized JSON objects.
record_path : str or list of str, default None
Path in each object to list of records. If not passed, data will be
assumed to be an array of records
meta : list of paths (string or list of strings), default None
Fields to use as metadata for each record in resulting table
meta_prefix : string, default None
record_prefix : string, default None
assumed to be an array of records.
meta : list of paths (str or list of str), default None
Fields to use as metadata for each record in resulting table.
meta_prefix : str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if
path to records is ['foo', 'bar']
meta is ['foo', 'bar'].
record_prefix : str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if
path to records is ['foo', 'bar'].
errors : {'raise', 'ignore'}, default 'raise'
Configures error handling.

* 'ignore' : will ignore KeyError if keys listed in meta are not
always present
always present.
* 'raise' : will raise KeyError if keys listed in meta are not
always present
always present.

.. versionadded:: 0.20.0

sep : string, default '.'
Nested records will generate names separated by sep,
e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar
sep : str, default '.'
Nested records will generate names separated by sep.
e.g., for sep='.', {'foo': {'bar': 0}} -> foo.bar.

.. versionadded:: 0.20.0

max_level : int, default None
Max number of levels(depth of dict) to normalize.
if None, normalizes all levels.

.. versionadded:: 0.25.0

Returns
-------
frame : DataFrame
Normalize semi-structured JSON data into a flat table.

Examples
--------
Expand All @@ -149,36 +173,62 @@ def json_normalize(data, record_path=None, meta=None,
1 NaN NaN Regner NaN Mose NaN
2 2.0 Faye Raker NaN NaN NaN NaN

>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': "Mose Reg",
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, max_level=0)
fitness id name
0 {'height': 130, 'weight': 60} 1.0 Cole Volk
1 {'height': 130, 'weight': 60} NaN Mose Reg
2 {'height': 130, 'weight': 60} 2.0 Faye Raker

Normalizes nested data upto level 1.

>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': "Mose Reg",
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, max_level=1)
fitness.height fitness.weight id name
0 130 60 1.0 Cole Volk
1 130 60 NaN Mose Reg
2 130 60 2.0 Faye Raker

>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
... 'info': {
... 'governor': 'Rick Scott'
... },
... 'info': {'governor': 'Rick Scott'},
... 'counties': [{'name': 'Dade', 'population': 12345},
... {'name': 'Broward', 'population': 40000},
... {'name': 'Palm Beach', 'population': 60000}]},
... {'name': 'Broward', 'population': 40000},
... {'name': 'Palm Beach', 'population': 60000}]},
... {'state': 'Ohio',
... 'shortname': 'OH',
... 'info': {
... 'governor': 'John Kasich'
... },
... 'info': {'governor': 'John Kasich'},
... 'counties': [{'name': 'Summit', 'population': 1234},
... {'name': 'Cuyahoga', 'population': 1337}]}]
>>> result = json_normalize(data, 'counties', ['state', 'shortname',
... ['info', 'governor']])
... ['info', 'governor']])
>>> result
name population info.governor state shortname
0 Dade 12345 Rick Scott Florida FL
1 Broward 40000 Rick Scott Florida FL
2 Palm Beach 60000 Rick Scott Florida FL
3 Summit 1234 John Kasich Ohio OH
4 Cuyahoga 1337 John Kasich Ohio OH
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich

>>> data = {'A': [1, 2]}
>>> json_normalize(data, 'A', record_prefix='Prefix.')
Prefix.0
0 1
1 2

Returns normalized data with columns prefixed with the given string.
"""
def _pull_field(js, spec):
result = js
Expand Down Expand Up @@ -206,7 +256,8 @@ def _pull_field(js, spec):
#
# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep)
data = nested_to_record(data, sep=sep,
max_level=max_level)
return DataFrame(data)
elif not isinstance(record_path, list):
record_path = [record_path]
Expand All @@ -219,10 +270,10 @@ def _pull_field(js, spec):
meta = [m if isinstance(m, list) else [m] for m in meta]

# Disastrously inefficient for now
records = []
records = [] # type: List
lengths = []

meta_vals = defaultdict(list)
meta_vals = defaultdict(list) # type: DefaultDict
if not isinstance(sep, str):
sep = str(sep)
meta_keys = [sep.join(val) for val in meta]
Expand All @@ -241,10 +292,12 @@ def _recursive_extract(data, path, seen_meta, level=0):
else:
for obj in data:
recs = _pull_field(obj, path[0])
recs = [nested_to_record(r, sep=sep,
max_level=max_level)
if isinstance(r, dict) else r for r in recs]

# For repeating the metadata later
lengths.append(len(recs))

for val, key in zip(meta, meta_keys):
if level + 1 > len(val):
meta_val = seen_meta[key]
Expand All @@ -260,7 +313,6 @@ def _recursive_extract(data, path, seen_meta, level=0):
"{err} is not always present"
.format(err=e))
meta_vals[key].append(meta_val)

records.extend(recs)

_recursive_extract(data, record_path, {}, level=0)
Expand All @@ -279,8 +331,5 @@ def _recursive_extract(data, path, seen_meta, level=0):
if k in result:
raise ValueError('Conflicting metadata name {name}, '
'need distinguishing prefix '.format(name=k))

# forcing dtype to object to avoid the metadata being casted to string
result[k] = np.array(v, dtype=object).repeat(lengths)

return result
Loading