From a15a9aece0ae1946e46430494559af22011bfb02 Mon Sep 17 00:00:00 2001 From: jl_win_a Date: Mon, 20 Jan 2025 12:04:46 +0000 Subject: [PATCH 1/5] DOC: add SPSS comparison guide structure - Create SPSS comparison documentation - Add header and introduction sections - Terminology translation table - Create template for common operations comparison Part of #60727 --- .../comparison/comparison_with_spss.rst | 229 ++++++++++++++++++ 1 file changed, 229 insertions(+) create mode 100644 doc/source/getting_started/comparison/comparison_with_spss.rst diff --git a/doc/source/getting_started/comparison/comparison_with_spss.rst b/doc/source/getting_started/comparison/comparison_with_spss.rst new file mode 100644 index 0000000000000..b6dbb26e5f71c --- /dev/null +++ b/doc/source/getting_started/comparison/comparison_with_spss.rst @@ -0,0 +1,229 @@ +.. _compare_with_spss: + +{{ header }} + +Comparison with SPSS +******************** +For potential users coming from `SPSS `__, this page is meant to demonstrate +how various SPSS operations would be performed using pandas. + +.. include:: includes/introduction.rst + +Data structures +--------------- + +General terminology translation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. csv-table:: + :header: "pandas", "SPSS" + :widths: 20, 20 + + ``DataFrame``, data file + column, variable + row, case + groupby, split file + ``NaN``, system-missing + +``DataFrame`` +~~~~~~~~~~~~~ + +A ``DataFrame`` in pandas is analogous to an SPSS data file - a two-dimensional +data source with labeled columns that can be of different types. As will be shown in this +document, almost any operation that can be performed in SPSS can also be accomplished in pandas. + +``Series`` +~~~~~~~~~~ + +A ``Series`` is the data structure that represents one column of a ``DataFrame``. SPSS doesn't have a +separate data structure for a single variable, but in general, working with a ``Series`` is analogous +to working with a variable in SPSS. + +``Index`` +~~~~~~~~~ + +Every ``DataFrame`` and ``Series`` has an ``Index`` -- labels on the *rows* of the data. SPSS does not +have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is +specified, a ``RangeIndex`` is used by default (first row = 0, second row = 1, and so on). + +While using a labeled ``Index`` or ``MultiIndex`` can enable sophisticated analyses and is ultimately an +important part of pandas to understand, for this comparison we will essentially ignore the ``Index`` and +just treat the ``DataFrame`` as a collection of columns. Please see the :ref:`indexing documentation` +for much more on how to use an ``Index`` effectively. + + +Copies vs. in place operations +----------------------------- + +.. include:: includes/copies.rst + + +Data input / output +------------------ + +Reading external data +~~~~~~~~~~~~~~~~~~~~ + +Like SPSS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within +the pandas tests (`csv `_) +will be used in many of the following examples. + +In SPSS, you would use File > Open > Data to import a CSV file: + +.. code-block:: text + + FILE > OPEN > DATA + /TYPE=CSV + /FILE='tips.csv' + /DELIMITERS="," + /FIRSTCASE=2 + /VARIABLES=col1 col2 col3. + +The pandas equivalent would use :func:`read_csv`: + +.. ipython:: python + + url = ( + "https://raw.githubusercontent.com/pandas-dev" + "/pandas/main/pandas/tests/io/data/csv/tips.csv" + ) + tips = pd.read_csv(url) + tips + +Like SPSS's data import wizard, ``read_csv`` can take a number of parameters to specify how the data should be parsed. +For example, if the data was instead tab delimited, and did not have column names, the pandas command would be: + +.. code-block:: python + + tips = pd.read_csv("tips.csv", sep="\t", header=None) + + # alternatively, read_table is an alias to read_csv with tab delimiter + tips = pd.read_table("tips.csv", header=None) + + +Data operations +--------------- + +Filtering +~~~~~~~~~ + +In SPSS, filtering is done through Data > Select Cases: + +.. code-block:: text + + SELECT IF (total_bill > 10). + EXECUTE. + +In pandas, boolean indexing can be used: + +.. ipython:: python + + tips[tips["total_bill"] > 10] + + +Sorting +~~~~~~~ + +In SPSS, sorting is done through Data > Sort Cases: + +.. code-block:: text + + SORT CASES BY sex total_bill. + EXECUTE. + +In pandas, this would be written as: + +.. ipython:: python + + tips.sort_values(["sex", "total_bill"]) + + +String processing +---------------- + +Finding length of string +~~~~~~~~~~~~~~~~~~~~~~~ + +In SPSS: + +.. code-block:: text + + COMPUTE length = LENGTH(time). + EXECUTE. + +.. include:: includes/length.rst + + +Changing case +~~~~~~~~~~~~ + +In SPSS: + +.. code-block:: text + + COMPUTE upper = UPCASE(time). + COMPUTE lower = LOWER(time). + EXECUTE. + +.. include:: includes/case.rst + + +Merging +------- + +In SPSS, merging data files is done through Data > Merge Files. + +.. include:: includes/merge_setup.rst +.. include:: includes/merge.rst + + +GroupBy operations +---------------- + +Split-file processing +~~~~~~~~~~~~~~~~~~~ + +In SPSS, split-file analysis is done through Data > Split File: + +.. code-block:: text + + SORT CASES BY sex. + SPLIT FILE BY sex. + DESCRIPTIVES VARIABLES=total_bill tip + /STATISTICS=MEAN STDDEV MIN MAX. + +The pandas equivalent would be: + +.. ipython:: python + + tips.groupby("sex")[["total_bill", "tip"]].agg(["mean", "std", "min", "max"]) + + +Missing data +----------- + +SPSS uses the period (``.``) for numeric missing values and blank spaces for string missing values. +pandas uses ``NaN`` (Not a Number) for numeric missing values and ``None`` or ``NaN`` for string +missing values. + +.. include:: includes/missing.rst + + +Other considerations +------------------ + +Output management +~~~~~~~~~~~~~~~ + +While pandas does not have a direct equivalent to SPSS's Output Management System (OMS), you can +capture and export results in various ways: + +.. code-block:: python + + # Save summary statistics to CSV + tips.groupby('sex')[['total_bill', 'tip']].mean().to_csv('summary.csv') + + # Save multiple results to Excel sheets + with pd.ExcelWriter('results.xlsx') as writer: + tips.describe().to_excel(writer, sheet_name='Descriptives') + tips.groupby('sex').mean().to_excel(writer, sheet_name='Means by Gender') \ No newline at end of file From 3bc4267a45117e0ca1202c6bd2f4ba1f09ddbaeb Mon Sep 17 00:00:00 2001 From: jl_win_a Date: Tue, 21 Jan 2025 18:37:57 +0000 Subject: [PATCH 2/5] DOC: edit SPSS comparison guide to documentation - Added file to doc/source/getting_started/comparison/index.rst toctree - Fixed formatting and whitespace issues to meet documentation standards --- .../comparison/comparison_with_spss.rst | 20 +++++++++---------- .../getting_started/comparison/index.rst | 1 + 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/doc/source/getting_started/comparison/comparison_with_spss.rst b/doc/source/getting_started/comparison/comparison_with_spss.rst index b6dbb26e5f71c..0f1ba6d64aebe 100644 --- a/doc/source/getting_started/comparison/comparison_with_spss.rst +++ b/doc/source/getting_started/comparison/comparison_with_spss.rst @@ -4,7 +4,7 @@ Comparison with SPSS ******************** -For potential users coming from `SPSS `__, this page is meant to demonstrate +For potential users coming from `SPSS `__, this page is meant to demonstrate how various SPSS operations would be performed using pandas. .. include:: includes/introduction.rst @@ -20,7 +20,7 @@ General terminology translation :widths: 20, 20 ``DataFrame``, data file - column, variable + column, variable row, case groupby, split file ``NaN``, system-missing @@ -29,7 +29,7 @@ General terminology translation ~~~~~~~~~~~~~ A ``DataFrame`` in pandas is analogous to an SPSS data file - a two-dimensional -data source with labeled columns that can be of different types. As will be shown in this +data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be performed in SPSS can also be accomplished in pandas. ``Series`` @@ -42,13 +42,13 @@ to working with a variable in SPSS. ``Index`` ~~~~~~~~~ -Every ``DataFrame`` and ``Series`` has an ``Index`` -- labels on the *rows* of the data. SPSS does not -have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is +Every ``DataFrame`` and ``Series`` has an ``Index`` -- labels on the *rows* of the data. SPSS does not +have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is specified, a ``RangeIndex`` is used by default (first row = 0, second row = 1, and so on). -While using a labeled ``Index`` or ``MultiIndex`` can enable sophisticated analyses and is ultimately an -important part of pandas to understand, for this comparison we will essentially ignore the ``Index`` and -just treat the ``DataFrame`` as a collection of columns. Please see the :ref:`indexing documentation` +While using a labeled ``Index`` or ``MultiIndex`` can enable sophisticated analyses and is ultimately an +important part of pandas to understand, for this comparison we will essentially ignore the ``Index`` and +just treat the ``DataFrame`` as a collection of columns. Please see the :ref:`indexing documentation` for much more on how to use an ``Index`` effectively. @@ -64,7 +64,7 @@ Data input / output Reading external data ~~~~~~~~~~~~~~~~~~~~ -Like SPSS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within +Like SPSS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within the pandas tests (`csv `_) will be used in many of the following examples. @@ -226,4 +226,4 @@ capture and export results in various ways: # Save multiple results to Excel sheets with pd.ExcelWriter('results.xlsx') as writer: tips.describe().to_excel(writer, sheet_name='Descriptives') - tips.groupby('sex').mean().to_excel(writer, sheet_name='Means by Gender') \ No newline at end of file + tips.groupby('sex').mean().to_excel(writer, sheet_name='Means by Gender') diff --git a/doc/source/getting_started/comparison/index.rst b/doc/source/getting_started/comparison/index.rst index c3f58ce1f3d6d..3133d74afa3db 100644 --- a/doc/source/getting_started/comparison/index.rst +++ b/doc/source/getting_started/comparison/index.rst @@ -14,3 +14,4 @@ Comparison with other tools comparison_with_spreadsheets comparison_with_sas comparison_with_stata + comparison_with_spss From 5993f248c61b952a606fe43ed459827b6926e64f Mon Sep 17 00:00:00 2001 From: jl_win_a Date: Tue, 21 Jan 2025 19:01:40 +0000 Subject: [PATCH 3/5] DOC: edit minor whitespaces in SPSS comparison guide --- .../comparison/comparison_with_spss.rst | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/doc/source/getting_started/comparison/comparison_with_spss.rst b/doc/source/getting_started/comparison/comparison_with_spss.rst index 0f1ba6d64aebe..a9baa41acfe11 100644 --- a/doc/source/getting_started/comparison/comparison_with_spss.rst +++ b/doc/source/getting_started/comparison/comparison_with_spss.rst @@ -53,16 +53,16 @@ for much more on how to use an ``Index`` effectively. Copies vs. in place operations ------------------------------ +------------------------------ .. include:: includes/copies.rst Data input / output ------------------- +------------------- Reading external data -~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~ Like SPSS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within the pandas tests (`csv `_) @@ -96,7 +96,7 @@ For example, if the data was instead tab delimited, and did not have column name .. code-block:: python tips = pd.read_csv("tips.csv", sep="\t", header=None) - + # alternatively, read_table is an alias to read_csv with tab delimiter tips = pd.read_table("tips.csv", header=None) @@ -139,10 +139,10 @@ In pandas, this would be written as: String processing ----------------- +----------------- Finding length of string -~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~ In SPSS: @@ -155,7 +155,7 @@ In SPSS: Changing case -~~~~~~~~~~~~ +~~~~~~~~~~~~~ In SPSS: @@ -178,10 +178,10 @@ In SPSS, merging data files is done through Data > Merge Files. GroupBy operations ----------------- +------------------ Split-file processing -~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~ In SPSS, split-file analysis is done through Data > Split File: @@ -200,7 +200,7 @@ The pandas equivalent would be: Missing data ------------ +------------ SPSS uses the period (``.``) for numeric missing values and blank spaces for string missing values. pandas uses ``NaN`` (Not a Number) for numeric missing values and ``None`` or ``NaN`` for string @@ -210,10 +210,10 @@ missing values. Other considerations ------------------- +-------------------- Output management -~~~~~~~~~~~~~~~ +----------------- While pandas does not have a direct equivalent to SPSS's Output Management System (OMS), you can capture and export results in various ways: From ea477c64a54c9b057080b4e5934caa7609447e41 Mon Sep 17 00:00:00 2001 From: jl_win_a Date: Tue, 21 Jan 2025 19:22:23 +0000 Subject: [PATCH 4/5] DOC: standardize class references in SPSS guide --- .../comparison/comparison_with_spss.rst | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/doc/source/getting_started/comparison/comparison_with_spss.rst b/doc/source/getting_started/comparison/comparison_with_spss.rst index a9baa41acfe11..d15de7643faa3 100644 --- a/doc/source/getting_started/comparison/comparison_with_spss.rst +++ b/doc/source/getting_started/comparison/comparison_with_spss.rst @@ -19,37 +19,37 @@ General terminology translation :header: "pandas", "SPSS" :widths: 20, 20 - ``DataFrame``, data file + :class:`DataFrame`, data file column, variable row, case groupby, split file - ``NaN``, system-missing + :class:`NaN`, system-missing -``DataFrame`` +:class:`DataFrame` ~~~~~~~~~~~~~ -A ``DataFrame`` in pandas is analogous to an SPSS data file - a two-dimensional +A :class:`DataFrame` in pandas is analogous to an SPSS data file - a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be performed in SPSS can also be accomplished in pandas. -``Series`` +:class:`Series` ~~~~~~~~~~ -A ``Series`` is the data structure that represents one column of a ``DataFrame``. SPSS doesn't have a -separate data structure for a single variable, but in general, working with a ``Series`` is analogous +A :class:`Series` is the data structure that represents one column of a :class:`DataFrame`. SPSS doesn't have a +separate data structure for a single variable, but in general, working with a :class:`Series` is analogous to working with a variable in SPSS. -``Index`` +:class:`Index` ~~~~~~~~~ -Every ``DataFrame`` and ``Series`` has an ``Index`` -- labels on the *rows* of the data. SPSS does not +Every :class:`DataFrame` and :class:`Series` has an :class:`Index` -- labels on the *rows* of the data. SPSS does not have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is -specified, a ``RangeIndex`` is used by default (first row = 0, second row = 1, and so on). +specified, a :class:`RangeIndex` is used by default (first row = 0, second row = 1, and so on). -While using a labeled ``Index`` or ``MultiIndex`` can enable sophisticated analyses and is ultimately an -important part of pandas to understand, for this comparison we will essentially ignore the ``Index`` and -just treat the ``DataFrame`` as a collection of columns. Please see the :ref:`indexing documentation` -for much more on how to use an ``Index`` effectively. +While using a labeled :class:`Index` or :class:`MultiIndex` can enable sophisticated analyses and is ultimately an +important part of pandas to understand, for this comparison we will essentially ignore the :class:`Index` and +just treat the :class:`DataFrame` as a collection of columns. Please see the :ref:`indexing documentation` +for much more on how to use an :class:`Index` effectively. Copies vs. in place operations @@ -81,7 +81,7 @@ In SPSS, you would use File > Open > Data to import a CSV file: The pandas equivalent would use :func:`read_csv`: -.. ipython:: python +.. code-block:: python url = ( "https://raw.githubusercontent.com/pandas-dev" @@ -116,7 +116,7 @@ In SPSS, filtering is done through Data > Select Cases: In pandas, boolean indexing can be used: -.. ipython:: python +.. code-block:: python tips[tips["total_bill"] > 10] @@ -133,7 +133,7 @@ In SPSS, sorting is done through Data > Sort Cases: In pandas, this would be written as: -.. ipython:: python +.. code-block:: python tips.sort_values(["sex", "total_bill"]) @@ -194,7 +194,7 @@ In SPSS, split-file analysis is done through Data > Split File: The pandas equivalent would be: -.. ipython:: python +.. code-block:: python tips.groupby("sex")[["total_bill", "tip"]].agg(["mean", "std", "min", "max"]) From d1cfb4da4ab1a294aaa85dadfa3effcbeec301e2 Mon Sep 17 00:00:00 2001 From: jl_win_a Date: Tue, 21 Jan 2025 20:00:25 +0000 Subject: [PATCH 5/5] DOC: Fix RST section underline lengths in SPSS comparison --- .../getting_started/comparison/comparison_with_spss.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/source/getting_started/comparison/comparison_with_spss.rst b/doc/source/getting_started/comparison/comparison_with_spss.rst index d15de7643faa3..12c64bfd180a3 100644 --- a/doc/source/getting_started/comparison/comparison_with_spss.rst +++ b/doc/source/getting_started/comparison/comparison_with_spss.rst @@ -26,21 +26,21 @@ General terminology translation :class:`NaN`, system-missing :class:`DataFrame` -~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~ A :class:`DataFrame` in pandas is analogous to an SPSS data file - a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be performed in SPSS can also be accomplished in pandas. :class:`Series` -~~~~~~~~~~ +~~~~~~~~~~~~~~~ A :class:`Series` is the data structure that represents one column of a :class:`DataFrame`. SPSS doesn't have a separate data structure for a single variable, but in general, working with a :class:`Series` is analogous to working with a variable in SPSS. :class:`Index` -~~~~~~~~~ +~~~~~~~~~~~~~~ Every :class:`DataFrame` and :class:`Series` has an :class:`Index` -- labels on the *rows* of the data. SPSS does not have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is