Data science is one of the most sought after skill-sets out there right now, with demand for those with skills far outstripping supply. This is good news for anyone who’s gained, are in the process of gaining, or are planning to gain those data analysis skills.
But what exactly is data science, what does it entail, and what makes a data scientist?
If you’re looking to get started in the world of data science but don’t know where to begin, read on to find out what data science is, what it really means, what you need to succeed in the field, and why its all the rage right now.
What is data science?
The chances are that you’ve probably already heard about data science before arriving here. In fact, I’d be surprised if you hadn’t since the buzz around the field has reached crazy levels in recent years.
If you only listened to the hype you might think that data scientists have nearly superhuman abilities that can solve pretty much every.single.problem that exists in the universe (spoiler alert: they can’t).
Everyone is so excited about data science that Harvard Business Review even listed data scientist as “the sexiest job in the 21st century.”
Data has become kind of a big deal and by extension, those that know how to work with large volumes of it are so hot right now. This rush for all things data science is in part due to the massive influx of data (aka “big data”) that now exists in the world and that is being produced every single day.
Understandably, this has led businesses (traditionally in the tech industry) to realize that being able to effectively and accurately analyze large quantities of data has the potential for very real (and very big) gains. To recruit the best and brightest employees that could help a company capitalize on this surge of data, companies began offering starting salaries in the six figures for job titles like data scientist and data analyst.
It’s not just companies that are interested in big data, though.
High demand for data science has also been fueled by researchers, analysts, and academics who are equally as excited by the potentials of big data (and small data too!) and want to be able to leverage technology to answer very real questions.
Combined with advances in computing power, this has all led to something exciting – what exactly that is though, is up for debate.
A quick internet search to help answer the question “what is data science?” yields contradictory answers filled with so many overused buzzwords that they are basically useless and often leave you with more questions than answers. For instance, how big is big data? Is data science even a science? Or, on an even more fundamental level, what is data? And perhaps the most important question of all – why should I care about data science?
So, while data science is now an established career path, the details about what it actually is are still a bit fuzzy.
Part of the reason there is so much confusion around data science is that it isn’t well-defined.
As Cathy O’Neil and Rachel Schutt discuss in their must-read book on the topic Doing Data Science: Straight Talk from the Frontline, “part of the problem is that there ‘is no well-defined body of knowledge, and there’s no canonical corpus. It’s popularized and celebrated in the press and media, but there’s no ‘authority’ to push back on erroneous or outrageous accounts. There’s a lot of overlap with various other subjects as well; it could become redundant with a machine learning textbook, for example.”
Despite all the hype and the ambiguous, hard to pin down definitions, data science is definitely something, even if only just an area of interest that is undergoing some sort of paradigm shift.
But what exactly is data science?
On the most basic level, data science is about using scientific methods to solve real problems with data.
More broadly, data science is a multidisciplinary field which brings together practitioners and specialists from a number of different areas, generally sitting at the overlap between statistics and math, computer science, and specific area expertise (this could be anything from business or chemistry to zoology, international relations, theology or history).
You might have already seen Drew Conway’s data science venn diagram circulating around the interweb – it does a pretty good job of illustrating data science’s multidisciplinary nature and how everything fits together.
Mike Driscoll offers a pretty neat definition of data science IRL:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.
But data science is not merely hacking, because when hackers finish debugging their Bash one-liners and Pig scripts, few care about non-Euclidean distance metrics.
And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a tab-delimited file into R if their job depended on it.
Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what’s possible.
A simpler and more condenses description of what a data scientist does comes from Mike Loukides:
Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
Breaking the name down, we arrive at a potentially suitable definition for our own use. Merriam-Webster defines the two components of data science as:
Data: factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
Science: the state of knowing : knowledge as distinguished from ignorance or misunderstanding
So, a useful way to define data science could be:
The process of collecting factual information, measurements, or statistics for reference or analysis in order to reach a state of knowledge or understanding.
It’s in the name, right? Data + science = data science.
We could then extend (and modernize) this definition even further to bring it inline with the technologies shaping the modern world to include the vast amounts of information (big data) we have access to and the ability to carry out predictive modelling and forecasts using things like machine learning and increasingly smart AI.
Data science: the basics
Although the term ‘data scientist’ was only ‘officially’ coined in 2008, the field as a discipline is actually pretty old, building on the work that statisticians have been doing for years.
It might seem like “machine learning algorithms were just invented last week and data was never ‘big’ until Google came along” but that just isn’t the case.
Everything that is being developed now is building on what came before and based on decades, or even centuries, of work done by all of the statisticians, computer scientists, engineers, mathematicians, and other scientists who have already paved the way.
With this in mind, perhaps a better way to think about data science in today’s context is to think about the basic definition and remit of the subject, alongside the contemporary environment and technologies that now allow us to take it to the next level.
Like data science 2.0.
While there’s nothing new about fundamental statistical methods or the process of carrying out a linear regression, and programming languages have been around for decades now, what is new is the way in which these different disciplines are leveraging technology and coming together in order to allow us to process, analyze, and understand the masses of data that we now have access to.
Since data science is a multidisciplinary field, it’s important to have a well-rounded skill set and if you’re interested in actually becoming a data scientist, experience in the following is recommended:
Statistics and math
Most data science projects will go something like this: you have a question to answer or a problem to solve, you get large amounts of data which you then run statistical analyses on.
Because the practice of data science is based on statistics and other mathematical concepts, having a basic understanding of statistics and math, particularly linear algebra, will help you immeasurably.
While it’s theoretically possible that you can do some data science without having a basic mathematical understanding, it’s really not recommended. Understanding why you’re getting the results that you are is essential and it’s nearly impossible to do this without understanding the mathematical basis underpinning whatever algorithm you’re using. Moreover, you’ll have no way of knowing whether your findings are actually correct or what implications they may have for the bigger picture.
Programming
When it comes to working with large sets of data, computers just do it better. Plus, the bulk of data that humans produce daily is digital, which means it exists in the realm of computers. Therefore, being able to communicate with computers and make them do what you want is a key part of data science.
When it comes to data science there are two primary languages you should know about: R and Python. If you’re interested in learning Python check out this series for beginners that covers installing Python and Anaconda, alongside the essential elements of Python.
Data wrangling
Again, data science is about data, so it’s logical that working with data will be one of the first things to learn. Data wrangling, also known as data munging, involves collecting, cleaning, preparing, manipulating, and moving data around. This is the decidely un-sexy part of data science and IRL will probably take up about 80% of your time. It also requires accuracy to prevent errors in your analysis.
Machine learning
Machine learning is a sub-discipline of artificial intelligence, however, there is significant overlap with data science, such as deep learning, clustering algorithms, and decision trees. Machine learning certainly has more to offer than data science uses, however, machine learning algorithms are rich and powerful tools that should be in a data scientist’s toolbox.
Domain expertise
While you can run every algorithm or statistical analysis out there, the real value is in interpretation. This is where domain expertise, or area knowledge, comes in. Being able to understand the context that the data you’re analysing exists within and how your numeric findings apply to real life is the whole point of being a data scientist. From biologists to art historians to political scientists to botanists – data science can be applied to pretty much any field out there so the sky’s the limit when it comes to choosing which area to become an expert in.
Communication and presentation skills
Perhaps the “softest” of the areas that a data scientist should be experienced in, communication skills might also be one of the most important. Being able to communicate and present your findings to team members, engineers, decision makers, and others who have never seen your data is key. Doing so in clear language with appropriate, accurate, and compelling data visualizations will bring your game to the next level.