Learning the foundations of data science inevitably means learning the terminology associated with the world of data analysis and statistics.
Understanding the essentials of data science and the analysis of large data sets will see you coming into contact with a lot of concepts and terms from the field and its surrounding disciplines.
This data science glossary aims to be your one-stop resource to help you find your way around the always growing world of data science.
Data Science and Statistics Glossary
A process or series of steps which are repeatable and can be used for calculations or carrying out specific tasks with data.
Artificial Intelligence (AI)
The meaning of intelligence is hotly contested, varied, and open to interpretation, but ultimately once a suitable definition is attributed, AI refers to any intelligence which isn’t biological.
Artificial intelligence and its development is intrinsically tied to the utilization of vast amounts of data with statistical analysis of data being the primary driver for machine learning and decision making.
You’ve almost certainly come across Big Data (both as a term and in its form in the real world) and this sits at the heart of what it actually is.
Big data ultimately refers to the once unimaginable volumes of data that we now create, have access to, and have the capability to work with.
Previously the ability to work with the massive amounts of data we have on hand would have been both unthinkable and impractical, however continual improvements in technology have made the storage and processing of big data realistic.
The proliferation of big data has increased the demand for those able to collect, clean, analyze, understand and interpret this information (data scientists) to an extent never before seen and that increases by the day.
When taken in the context of machine learning from a data set, bias refers to the tendency of the learner to learn the same wrong thing consistently.
Similarly, variance refers to the tendency of the learner to learn random things despite the actual signal provided.
This brings the two concepts of bias (underfitting) and variance (overfitting) into context and shows how it’s easy to fall into the situation of reaching one error while trying to avoid the other.
The key to landing in the optimal situation and avoiding both bias and variance is through the learning of what’s known as a perfect classifier. Unfortunately, unless we know this up front, we have to settle with finding our way through in the best way possible.
Chi-square (pronounced kai) is a statistical method which tests whether a classification of data can simply be attributed to random chance or whether there is an underlying law which determines it.
Carrying out a chi-square test allows you to estimate whether two variables in a cross tabulation are correlated.
The result of carrying out this technique – the chi-square distribution – will vary from the normal distribution based on what are known as the degrees of freedom that were used to calculate it.
The coefficient is the number or symbol that you prefix to a variable or an as yet unknown quantity.
x is the coefficient in the formula x(y + z)
5 is the coefficient in 5ab
If you’re representing an equation as a graph (for example y = 2x + 8) it’s the coefficient of the x value that will determine the slope of the line on the graph.
Refers to the amount of correspondence (relationship) between two sets of data. If Knowledge of data science increases as Number of hours studies also increases, then they correlate.
The crucial thing to remember as a data scientist or statistician is that correlation does not equal causation. Just because there’s an apparent relationship between two variables, it doesn’t mean that one thing is causing the other.
What’s known as the correlation coefficient is used to measure the degree of correlation between two data sets – whether there’s strong or weak and positive or negative correlation as well as the degrees of each along a scale.
A correlation coefficient of 1 is considered a perfect positive correlation with the positive correlation range decreasing down to 0 (0.9 would be a strong positive correlation, 0.3 a weak positive correlation etc).
Negative correlation works in the same way but goes sub-zero down to a correlation coefficient of -1 which represents a perfect negative correlation. An example of this would be Number of hours procrastinating on social media going down while Number of hours per day studying data science goes up.
Data can be thought of as any facts and statistics that can be collected for analysis or reference.
In the context of data science, data can be visualized as distinct pieces of information for anything that has existed (or has taken place) or currently exists and that can be quantified and gathered together for the purpose of analysis.
While the term data is often also used to refer specifically to computer information that is transmitted or stored, it is important to remember that any fact or statistic is actually a data point and can therefore be considered data.
One interesting and little known fact is that the word datum actually refers to a single piece of information whereas data is the plural.
If we know that data is defined as any facts and statistics that can be collected and quantified and science is the state of knowing through the acquisition of knowledge, then a commonly agreed upon definition for data science would be the process of gaining insight and knowledge from the analysis of data sets. The era of big data (see above) has taken this field even further and sees the discipline of data science primarily concerned with the collection, organization, analysis, and interpretation of large and complex sets of data accomplished with the aid of specialized technology and software for handling such significant quantities of information.
A commercial numerical computing environment popular and a programming language in its own right, MATLAB is popular for both visualization and for developing algorithms.
Most often the mean is used to refer to the arithmetic – the average value from a list of numbers or a numerical data set.
The median value is the middle number in a set of sorted values (think median = middle to help remember this). If there are an even number of values in the set of numbers, then the average of the two middle numbers is taken to gain the median value.
Refers to the most common or most frequent value in a set of numbers or sample of data. Confusingly, in statistics, the term mode can also be used to refer to the type of data being worked with (i.e. integer, date etc) so don’t get thrown by this if you come across it in your travels through data!
Python is a programming language that’s widely used by both beginners and advanced users in fields which work with large quantities of data. The ease of use and significant power of the language make it highly popular with in both data science and machine learning applications, both of which benefit from the language’s highly specialized libraries for tasks such as the generation of graphs.
R is an open-source programming language and free environment which is used for statistical computing and the creation of visual data graphics. Available across Linux, Windows, and Mac systems, R is considered one of the major programming languages to become familiar with for exploring, analyzing, and utilizing large sets of data.
This is the amount by which a list of numbers varies from the average value (which is also known as the mean value – see above). The variance is commonly utilized in data science and statistics to determine how great a difference is within a set of numbers. In fact the variance is so frequently used, that many stats software packages will provide an automated way to calculate it right out of the box.
In the world of math and physics, a vector can be thought of as a mathematical expression of movement which combines magnitude (amount) and direction. For the purpose of data science, a similar concept forms the basis for what’s meant by this term although specifically, data scientists define a vector as an ordered set of real numbers with each referring to a distance on an axis in the form of coordinates. This set of numbers can be used to represent information about basically anything that’s being modeled and representing the values in this way makes it easier for software to run complex math operations on the data – great news for analysing large amounts of data.