Related Posts
Additional Posts in Data & Analytics Consultants
Has anyone else begun to resent data science?
Are there RPA use cases in data work?
New to Fishbowl?
Download the Fishbowl app to
unlock all discussions on Fishbowl.
unlock all discussions on Fishbowl.




In statistics, we usually only ever sample randomly. The size of a sample is based on the distribution of the data and a formula that uses st dev, pop size, and confidence level you’re looking for. Example below has the formula for normal distributions. This will give you a sample size that, when randomly sampled, should give you a representative subset of your data.
https://math.stackexchange.com/questions/3490/optimal-sample-size
Given the context of your problem, I’d treat each of the different attributes as it’s own population that should be sampled. So if you have 3 attributes, you should really sample n*3.
Without knowing more context it’s hard to say how you should proceed, but it’s pretty common to use various clustering/MiniSOM/PCA algos to identify which samples are similar and can be represented by a subset or trend
I’m a statistician and may be able to help, you can DM me if you would like. One idea that came to mind was, if the population of interest can be represented as a categorical variable then you might consider using stratified random sampling to ensure that you have the highest diversity within the sample amid having a small sample size.
Stratified random sampling