Skip to content

Commit 4f1d213

Browse files
committed
Beautify spec whitespace
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
1 parent cc36a6a commit 4f1d213

File tree

1 file changed

+31
-0
lines changed

1 file changed

+31
-0
lines changed

pandas/api/exchange/dataframe_protocol.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import enum
33
from abc import ABC, abstractmethod
44

5+
56
class DlpackDeviceType(enum.IntEnum):
67
"""Integer enum for device type codes matching DLPack."""
78

@@ -14,6 +15,7 @@ class DlpackDeviceType(enum.IntEnum):
1415
VPI = 9
1516
ROCM = 10
1617

18+
1719
class DtypeKind(enum.IntEnum):
1820
"""
1921
Integer enum for data types.
@@ -44,6 +46,7 @@ class DtypeKind(enum.IntEnum):
4446
DATETIME = 22
4547
CATEGORICAL = 23
4648

49+
4750
class ColumnNullType(enum.IntEnum):
4851
"""
4952
Integer enum for null type representation.
@@ -68,6 +71,7 @@ class ColumnNullType(enum.IntEnum):
6871
USE_BITMASK = 3
6972
USE_BYTEMASK = 4
7073

74+
7175
class ColumnBuffers(TypedDict):
7276
data: Tuple["Buffer", Any] # first element is a buffer containing the column data;
7377
# second element is the data buffer's associated dtype
@@ -86,11 +90,13 @@ class ColumnBuffers(TypedDict):
8690
class Buffer(ABC):
8791
"""
8892
Data in the buffer is guaranteed to be contiguous in memory.
93+
8994
Note that there is no dtype attribute present, a buffer can be thought of
9095
as simply a block of memory. However, if the column that the buffer is
9196
attached to has a dtype that's supported by DLPack and ``__dlpack__`` is
9297
implemented, then that dtype information will be contained in the return
9398
value from ``__dlpack__``.
99+
94100
This distinction is useful to support both data exchange via DLPack on a
95101
buffer and (b) dtypes like variable-length strings which do not have a
96102
fixed number of bytes per element.
@@ -116,9 +122,12 @@ def ptr(self) -> int:
116122
def __dlpack__(self):
117123
"""
118124
Produce DLPack capsule (see array API standard).
125+
119126
Raises:
127+
120128
- TypeError : if the buffer contains unsupported dtypes.
121129
- NotImplementedError : if DLPack support is not implemented
130+
122131
Useful to have to connect to array libraries. Support optional because
123132
it's not completely trivial to implement for a Python-only library.
124133
"""
@@ -138,27 +147,33 @@ class Column(ABC):
138147
"""
139148
A column object, with only the methods and properties required by the
140149
interchange protocol defined.
150+
141151
A column can contain one or more chunks. Each chunk can contain up to three
142152
buffers - a data buffer, a mask buffer (depending on null representation),
143153
and an offsets buffer (if variable-size binary; e.g., variable-length
144154
strings).
155+
145156
TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
146157
Instead, it seems to use "children" for both columns with a bit mask,
147158
and for nested dtypes. Unclear whether this is elegant or confusing.
148159
This design requires checking the null representation explicitly.
160+
149161
The Arrow design requires checking:
150162
1. the ARROW_FLAG_NULLABLE (for sentinel values)
151163
2. if a column has two children, combined with one of those children
152164
having a null dtype.
165+
153166
Making the mask concept explicit seems useful. One null dtype would
154167
not be enough to cover both bit and byte masks, so that would mean
155168
even more checking if we did it the Arrow way.
169+
156170
TBD: there's also the "chunk" concept here, which is implicit in Arrow as
157171
multiple buffers per array (= column here). Semantically it may make
158172
sense to have both: chunks were meant for example for lazy evaluation
159173
of data which doesn't fit in memory, while multiple buffers per column
160174
could also come from doing a selection operation on a single
161175
contiguous buffer.
176+
162177
Given these concepts, one would expect chunks to be all of the same
163178
size (say a 10,000 row dataframe could have 10 chunks of 1,000 rows),
164179
while multiple buffers could have data-dependent lengths. Not an issue
@@ -167,6 +182,7 @@ class Column(ABC):
167182
Are multiple chunks *and* multiple buffers per column necessary for
168183
the purposes of this interchange protocol, or must producers either
169184
reuse the chunk concept for this or copy the data?
185+
170186
Note: this Column object can only be produced by ``__dataframe__``, so
171187
doesn't need its own version or ``__column__`` protocol.
172188
"""
@@ -176,6 +192,7 @@ class Column(ABC):
176192
def size(self) -> Optional[int]:
177193
"""
178194
Size of the column, in elements.
195+
179196
Corresponds to DataFrame.num_rows() if column is a single chunk;
180197
equal to size of this current chunk otherwise.
181198
"""
@@ -186,6 +203,7 @@ def size(self) -> Optional[int]:
186203
def offset(self) -> int:
187204
"""
188205
Offset of first element.
206+
189207
May be > 0 if using chunks; for example for a column with N chunks of
190208
equal size M (only the last chunk may be shorter),
191209
``offset = n * M``, ``n = 0 .. N-1``.
@@ -197,10 +215,12 @@ def offset(self) -> int:
197215
def dtype(self) -> Tuple[DtypeKind, int, str, str]:
198216
"""
199217
Dtype description as a tuple ``(kind, bit-width, format string, endianness)``.
218+
200219
Bit-width : the number of bits as an integer
201220
Format string : data type description format string in Apache Arrow C
202221
Data Interface format.
203222
Endianness : current only native endianness (``=``) is supported
223+
204224
Notes:
205225
- Kind specifiers are aligned with DLPack where possible (hence the
206226
jump to 20, leave enough room for future extension)
@@ -229,6 +249,7 @@ def describe_categorical(self) -> Tuple[bool, bool, Optional[dict]]:
229249
If the dtype is categorical, there are two options:
230250
- There are only values in the data buffer.
231251
- There is a separate dictionary-style encoding for categorical values.
252+
232253
Raises TypeError if the dtype is not categorical
233254
234255
Returns the description on how to interpret the data buffer:
@@ -238,6 +259,7 @@ def describe_categorical(self) -> Tuple[bool, bool, Optional[dict]]:
238259
categorical values to other objects exists
239260
- "mapping" : dict, Python-level only (e.g. ``{int: str}``).
240261
None if not a dictionary-style categorical.
262+
241263
TBD: are there any other in-memory representations that are needed?
242264
"""
243265
pass
@@ -248,6 +270,7 @@ def describe_null(self) -> Tuple[ColumnNullType, Any]:
248270
"""
249271
Return the missing value (or "null") representation the column dtype
250272
uses, as a tuple ``(kind, value)``.
273+
251274
Value : if kind is "sentinel value", the actual value. If kind is a bit
252275
mask or a byte mask, the value (0 or 1) indicating a missing value. None
253276
otherwise.
@@ -259,6 +282,7 @@ def describe_null(self) -> Tuple[ColumnNullType, Any]:
259282
def null_count(self) -> Optional[int]:
260283
"""
261284
Number of null elements, if known.
285+
262286
Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
263287
"""
264288
pass
@@ -282,6 +306,7 @@ def num_chunks(self) -> int:
282306
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable["Column"]:
283307
"""
284308
Return an iterator yielding the chunks.
309+
285310
See `DataFrame.get_chunks` for details on ``n_chunks``.
286311
"""
287312
pass
@@ -290,7 +315,9 @@ def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable["Column"]:
290315
def get_buffers(self) -> ColumnBuffers:
291316
"""
292317
Return a dictionary containing the underlying buffers.
318+
293319
The returned dictionary has the following contents:
320+
294321
- "data": a two-element tuple whose first element is a buffer
295322
containing the data and whose second element is the data
296323
buffer's associated dtype.
@@ -320,14 +347,17 @@ class DataFrame(ABC):
320347
"""
321348
A data frame class, with only the methods required by the interchange
322349
protocol defined.
350+
323351
A "data frame" represents an ordered collection of named columns.
324352
A column's "name" must be a unique string.
325353
Columns may be accessed by name or by position.
354+
326355
This could be a public data frame class, or an object with the methods and
327356
attributes defined on this DataFrame class could be returned from the
328357
``__dataframe__`` method of a public data frame class in a library adhering
329358
to the dataframe interchange protocol specification.
330359
"""
360+
331361
version = 0 # version of the protocol
332362

333363
@property
@@ -414,6 +444,7 @@ def select_columns_by_name(self, names: Sequence[str]) -> "DataFrame":
414444
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable["DataFrame"]:
415445
"""
416446
Return an iterator yielding the chunks.
447+
417448
By default (None), yields the chunks that the data is stored as by the
418449
producer. If given, ``n_chunks`` must be a multiple of
419450
``self.num_chunks()``, meaning the producer must subdivide each chunk

0 commit comments

Comments
 (0)