Skip to content

Allow different splitting rules #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 7, 2023
Merged

Conversation

velochy
Copy link
Contributor

@velochy velochy commented Jul 1, 2023

This PR is inspired by the wish to better handle categorical predictors, and basically implements the paper https://arxiv.org/abs/2211.04459

Not sure how the PR process works for your repo, but happy to comply with any and all standards to get the changes merged :)

@velochy velochy force-pushed the main branch 2 times, most recently from 1441aa5 to ddc2967 Compare July 2, 2023 11:31
@aloctavodia
Copy link
Member

Thanks for this PR!

Not sure we have a contributing guide for this repo, or it may be outdated (sorry about that). But in order for a PR to be accepted, the code should be formatted with black -l 100, pylint should pass, ideally we should have type hints (we use mypy). And tests should be added.

@aloctavodia
Copy link
Member

@juanitorduz as you have some experience with BART and categorical variables you may have comments about this.

class SubsetSplitRule:
"""
Choose a random subset of the categorical values and branch on if the value is within the chosen set.
This is the approach taken in flexBART paper.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reference to the paper including name of the article, authors and DOI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's this paper: https://arxiv.org/abs/2211.04459
Do you have an example of citation formating somewhere so I can align the style?

Copy link
Contributor

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting! Whenever I need to work with categorical variables, I actively omit the one-hot-encoding (reasons similar to the ones presented in the paper), and rather use variations of target encoding (see Category Encoders).

The idea behind flexBART seems very interesting, especially the "pooling" between categories through the tree structures. I do not know the full details, and I think @aloctavodia is the expert to review the code in detail. I left some small comments regarding the code. I will be happy to help regarding type-hints and looking into tests :)

@@ -109,6 +109,7 @@ def __new__(
beta: float = 2.0,
response: str = "constant",
split_prior: Optional[List[float]] = None,
split_rules: Optional[List] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep the exact expected types (maybe ContinuousSplitRule from what I see below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I need a bit of help. If you look at split_rules.py you see there are multiple different split rule options.
They essentially form an abstract class, but as they have staticmethods (for performance reasons) I don't know if it is possible to abstract them into a SplitRule class type in any reasonable way that actually enforces the static methods existing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did my best here: created a class SplitRule with two static abstract methods and made other split rules inherit from it, then added SplitRule as hint here.

Copy link
Contributor

@juanitorduz juanitorduz Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! It seems mypy does not like abstract classes type=hints (python/mypy#4717). My suggestion is to focus on the core functionalities and test and come back to this minor detail once we are ready to merge :) (From my side, I will investigate how to do it)

@@ -0,0 +1,102 @@
# Copyright 2022 The PyMC Developers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for this module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try. Don't have much experience with pytest but I can hopefully figure it out :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in https://github.com/pymc-devs/pymc-bart/blob/main/tests/test_tree.py are probably the simplest ones in PyMC-BART, you can try following a similar pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Osvaldo A Martin <aloctavodia@gmail.com>
@velochy
Copy link
Contributor Author

velochy commented Jul 7, 2023

I have now addressed all the comments to the best of my knowledge.
@juanitorduz do you have any more comments or improvement recommendations?

@aloctavodia
Copy link
Member

You need to run black -l 100 pymc_bart/tree.py

@aloctavodia aloctavodia merged commit e5c95f0 into pymc-devs:main Jul 7, 2023
@aloctavodia
Copy link
Member

Thanks @velochy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants