Related Posts
Does HCL run BGV on UAN?
Anyone wants to go to see Jo Koy in BK?
Additional Posts in Data & Analytics Consultants
What is a data lake in basic terms?
New to Fishbowl?
Download the Fishbowl app to
unlock all discussions on Fishbowl.
unlock all discussions on Fishbowl.



Finally. Thanks for this OP. Anything but dummy variables
Pretty click bait. I’d be more worried about somebody skimming this and taking away that they need to impact code their four level factor for their linear regression. But overall not terrible and I learned some new techniques. One I’ve used where my variable has too many levels (eg zip codes) is impact coding
Mentor
I’m pretty sure they made a python port but the vtreat package for R is excellent. As are all of John Mounts packages.
D1 I get what you're saying but it doesn't seem click bait-y, at all. Raises a valid point and gives a few examples of alternatives. I think it would be a bit unkind to judge the article by whether the level of attention paid by the reader. Anyone is free to misinterpret anything.
I think the overall point is made well. Over reliance on funny variables can be used to forego preliminary data analysis and reasonably simple and intuitive decision making prior to model build.
What salt? Unnecessarily sugary. Stop sugarcoating. Why do you feel the need to make excuses for click bait?
It’s good to point out, but yea the title is a little click bait. I think a lot of techniques that are common across data science aren’t optimal ways of doing analysis. Mixed effects models can work well with high factored categorical variables, and can provide a theoretical range for un-observed categories
Thanks for posting. This was one of the first questions I had when I moved from R to python and made random forests. But as it is standard practice to use one hot encoding in python I quickly forgot my concerns.