Where to start?

Basic data science tools and statistics

Hussain Burhani
5 min readSep 6, 2019

--

Simply getting going with something and improving on it later, is perhaps, a prudent approach to solving data science problems. In the same vein, I’m starting off by scribbling some sparsely edited and researched articles, and despite, publishing them on Medium, to see where to go from here :-)

Note: For my data science projects, I’m currently using a Windows 10 laptop on an Intel i5 processor with 2 cores, 10GBs of RAM, and a 256GB SSD.

Standard installations and accounts

  1. Git (which includes Git Bash): Git is a version tracking program and Git Bash emulates a command line environment.
  2. A GitHub account: GitHub is the web-based version of Git. In addition to allowing you to share your code to a wider audience, it also acts as a place to back-up your work to the cloud.
  3. Anaconda: Among several other data science tools, Anaconda also comes pre-packaged with Python, common data science libraries (including numpy, pandas, scikit-learn, and seaborn), the Conda package manager, and Jupyter Lab.

Daily workflow

  1. Open Git Bash to start the command line environment.
  2. In Git Bash, enter jupyter lab to start up Jupyter Lab.
  3. Within Jupyter Lab, click File > New > Terminal.
  4. [Optional] In the Terminal command line environment, enter source activate <virtual_environment_name> , assuming you have a virtual environment set up, and configured with ipykernel. In addition, install conda install -c conda-forge nb_conda_kernels so Jupyter Lab can autodetect virtual environments.
  5. Click File > New > Notebook and select the appropriate virtual environment.

That’s it. Both Git Bash and Terminal (since you started Jupyter Lab via Git Bash) use UNIX commands. Through Jupyter Lab, you have access to Terminal, a file browser (to access your git repositories), a Python console, as well as Jupyter Lab notebooks, which are run, ideally, on your chosen virtual environment. Once you have your tabs within Jupyter laid out in a manner you like, you can even save the configuration and work in an environment which feels very much like an all inclusive, browser-based IDE.

By the way, did you know that a Dutch guy authored Python? Well, he did, and there is a nod to him if you simply type import this in your Python console.

Statistics basics

Switching gears, a start to data science is incomplete without, at least a basic refresher on, statistics. I’ve sprinkled some broad concepts here to dust off the many layers and flavors of this rich field, which constitutes the foundation of data science.

If the reader notices corrections to be made to this section, or that said, to anything else which should be corrected, clarified further, or improved upon, I would appreciate you letting me know!

Okay, so let’s just dive right in. A random variable is the chance outcome of an event mapped out numerically, e.g. how many times does one get heads when one flips a coin three times? Let X equal the random variable where the list of outcomes of the experiment are are 0, 1, 2, 3 . These represent all the values X can take, and is also known as the sample space of X. Taking it a step further, the distribution of the random variable X represents both, all the possible outcomes, and the probability of each possible outcome.

Random variables generally fall under two buckets: discrete, where the outcomes are countable, and continuous, where the outcomes are uncountable. In the above example, one can count, or list out, all the outcomes of the experiment by recording how many times one gets heads when one flips a coin three times. Therefore, because the outcomes are countable, X, is a discrete random variable. If the experiment entailed finding the weight of a glass of water, random variable, X, would be continuous random variable. This is because weight cannot be counted, per se, i.e. X could take on an uncountable number of values 1, 1.01, 1.230002, 5.333339999222229048279499920000000000001, and so on. However, both discrete and continuous random variables can range from negative to positive infinity.

The expected value of the random variable is it’s mean, or average, value. For a discrete variable it is simply the sum of each value in the sample space multiplied by the probability of that value occurring. For a continuous variable the expected value is the area under the curve of the distribution, and so we would use integration here to evaluate this.

Descriptive statistics focus on summarizing, describing, and understanding observable data, whereas inferential statistics generalizes results from a sample to a larger population.

The central tendency of the distribution can be measured by the mean, median, and mode. If the mean is large than the median, then the distribution is right-skewed.

The spread of the distribution can be measured by range, variance, and standard deviation. The range is simply the maximum value less the minimum value (or in the case of the inter-quartile range, the difference between the 75th percentile and 25th percentile values), the variance is the sum of the squared differences (errors) of each outcome from the mean, and the standard deviation is the square root of the variance. The reason why we take the square root of the variance is because we had originally squared the errors for the variance in the first place (in order to remove the impact of positive and negative errors cancelling each other out).

Moving on to inferential statistics (inferring population characteristics from samples), a sampling distribution is the mean of every possible sample of size n from the population (theoretical notion). This is known as the sampling distribution of X , i.e. sampling distribution of the sample mean for sample size n.

The central limit theorem states that the sampling distribution (note this is the distribution of the means of every possible sample of size n) approaches a normal distribution when n gets larger (generally n ≥ 30 ), irrespective of the shape of the population distribution. The conditions we need to meet for inferring population parameters from sample statistics are:

  1. Sampling should be random so as to eliminate (or minimize) bias.
  2. The sampling distribution should be approximately normal (generally n ≥ 30).
  3. Individual observations must be independent. In a sample without replacement, technically the observations are not independent (since the population is decreasing by 1 with each observation), but we can treat individual observations as independent if the sample size is less than 10% of the population.

In addition, some main types of distributions that discrete variables take on, include: Uniform, Bernoulli, Binomial, Poisson, and Geometric, and for continuous variables: Uniform, Exponential, and Normal (though there are more types). But, I’ll leave more on this, perhaps, for another article, another time.

--

--

Hussain Burhani

Having fun with data, ai, and all those sorts of things