I have personnel data from headcount, leavers, time in grade, performance and rankings, maternity temporary leaves maybe university of the hires, promotions etc. And I am building a Diversity Dashboard splitting all that data by gender in Tableau. What I want to do is to step further and apply predictive analytics. But I need ideas about how to do it, Im learning python so that wouldn’t be a problem if I need to learn doing something with python how can I start ? Confident with alteryx and SQL
Bowl Leader
Get someone who knows what they're doing on the team. Experiment for personal growth, but don't deliver anything beyond the dashboard unless you get it vetted by a statistician and legal. You're working on data that's very touchy from both a legal and PR standpoint. Mistakes here can seriously damage both the firm and the client.
If you are going to experiment, start with any short intro ML or statistics course. Doesn't really matter which if you're at the point where you're not sure where to start.
How big is the data set ? rows ? variables ? Can you start with some basic regression in excel ?
Tables are in SQL , for example we have monthly headcount since 2010, so near to 300,000 . But needs cleaning and have inconsistencies, currently working with July 2018 to today. Variables are level, gender line of business , group, maybe age, university, tier/ranking, commentaries about performance , productivity, hire and promotion dates
Not saying I would model that in Excel [you could as a starting point] but that’s a small data set. I would get into into a tool where you could see, work with the data
Thanks That’s something that I constantly remember working with hr data, I’m going to dig into it
Seriously, figure out what you’re doing before you start communicating “findings”. I can’t tell you how many times I’ve seen these sort of findings being discussed, with no foundation whatsoever, other than a simple observation that has no statistical backing.
People seem to be aware of the phrase “correlation is not causation”, but then dismiss the idea as soon as they have an observation that fits their hypothesis or world view, more generally.
For example, let’s say you observe that, on average, gender A has a higher salary than gender B. Some might get all excited at this point and start screaming “gender inequality”. However, this observation means nothing. It could be the case, hypothetically, that a disproportionate number of individuals of gender A live in higher cost of living cities or are aligned to higher- value (financially) practices in a firm (like strategy vs operations). These are what you might call confounding variables. Proper statistical analysis will flush these out.
Now to attempt to answer your question, try regression, which will help you to understand which independent variables drive variation in whatever output variables you what to explore (and to what degree). Taking my example above, the output variable would be salary. The inputs, or independent variables, might be location, organizational group, years of experience, etc. If you have the “right” data, you will be able to figure this out. But, it could be the case that you don’t have the right data, which you’ll discover in a low R squared value for your regression model.
What exactly are you trying to predict here? What are your hypotheses? Figuring out how to code a regression model is easy, what will take more time is understanding the ethics and implications of what you’re doing. Look into confirmation bias, unbalanced datasets, and overall ethics. HR data is not something to play with if you don’t know what you’re doing. I’d be really annoyed to find out some random person in the firm is predicting my tenure at the firm or future salary growth based off a few scarce data points. I can already tell maternity leave data is going to disproportionately affect your prediction results given that many women are let go when taking it.