My POV: Data Scientist Skills I Think You Need To Have

Many data scientists have years of higher education invested in becoming one. In their professional work life, they have had to probably go through many rings of fire before elevated to the title of data scientist in their organizations.

Based on personal experience, reading through college curriculums, blog posts from data scientists, research companies’ papers and blogs, etc., here are some of the key attributes I feel a data scientist needs to have. Sources used have been noted and provided at the end of this article.

Education (No Brick in the Wall Here, You Need an Education!)

In a post on KDNuggets [5], they suggest that Data Scientists are highly educated with 88% having at least a Master’s degree and 46% have PhDs. They also state, “and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. To become a data scientist, you could earn a Bachelor’s degree in Computer Science, Social Sciences, Physical Sciences, and Statistics. The most common fields of study are Mathematics and Statistics (32%), followed by Computer Science (19%) and Engineering (16%). A degree in any of these courses will give you the skills you need to process and analyze big data.”

Then, KDNuggets suggests various technical skills that are required to truly be a Data Scientist. These include R Programming, Python, Java, Perl, C/C++, Hadoop, MapReduce, SQL, Hive, Pig, Apache Spark, Data Visualization, Machine Learning and AI, Unstructured Data (“Dark Analytics”), etc.

The World of Data Warehousing is Changing. Change with it.

Regarding the realm of data warehousing, Stephen J. Smith, in his Eckerson Group article, The Demise of the Data Warehouse [12], discusses a future where there will be no data warehouse. Instead, we will have a data lake plus master data management (DL+MDM). In DL+MDM, the first principle is you want to keep track of the data about the data (metadata) and only move the data when it was absolutely required.

The second important principle is that for the data warehouse to be useful to the business, it must also provide the ‘truth’. This is where MDM comes in.

Thus, the data lake provides access to the data and the MDM provides the truth.

But the third principle of the data warehouse is speed. In the past, the speed had to come from physically moving the data closer together (and is still a good idea if you can – aka ‘data gravity’). But today, with the elasticity of the cloud and MPP (Massively Parallel Processing) speed and size are much less of an issue.

Many new tools are now available to support this paradigm. As a data scientist, you want to make sure you are, at least, aware of these tools, their capabilities, and the part they play in this paradigm.

Self-Service Data Preparation Tools

Tools like Hadoop and R Programming have steep learning curves. However, tools like Alteryx and Tableau Prep will help you easily connect to disparate data sources, clean and enhance the data, and then pass it along to other tools, such as Tableau, to perform your analytics and develop a deeper understanding of the data. The use of drag-and-drop of the tasks in the workflow makes these tools easier for non-technical people (such as non-technical citizen data scientists) to learn and use to do their own self-service preparation of their data.

SQL

Personally, I think the most important technical skill for a Data Scientist to have is a solid understand of the Structured Query Language, SQL. SQL allows you to perform the CRUD (Create, Read, Update and Delete) operations you need to manipulate data in a database such as Oracle, SQL Server, and DB2. Also, it’s basic core set of built-in functions allow you to perform analytical functions and transform your database structures.

In my experience, most employers want their data science personnel to have a solid foundation in SQL. In fact, on many technical interviews, potential candidates may be asked to write some pseudo-SQL code on a whiteboard based on a business question they ask you.

In my work environment, we have many business-facing people with limited technical backgrounds who can do their own data analysis using SQL to go against our various databases and Tableau Extracts.

Machine Learning/AI

I personally think, right now, having a solid background in Machine Learning and Artificial Intelligence (AI) will get you the biggest bang for your buck. This is the area where there seems to be the biggest shortage of skilled workers.

So, what does this entail? Per Jeremie Harris [13], a Machine Learning Engineer’s job is to build, optimize and deploy machine learning models to production. Generally, you will be treating machine learning models as APIs or components, which you’ll be plugging into a full-stack app or hardware of some kind, but you may also be called upon to design models yourself.

Some of the requirements to develop your Machine Learning skills include working with Python, JavaScript, scikit-learn, TensorFlow/PyTorch (and/or enterprise deep learning frameworks), and SQL or MongoDB (typically used for app DBs).

Also, you will need to have a solid fundamental understanding of concepts such as supervised and unsupervised machine learning, time series, natural language processing, outlier detection, computer vision, recommendation engines, survival analysis, reinforcement learning, and adversarial learning. [5]

Yes, there is more to learn

I will leave discussing other skill sets you need to be a data scientist for a future blog post. But, I at least want to give you a preliminary laundry list of these skills. This list would include,

  • R Programming
  • Python
  • Hadoop
  • Apache Spark
  • Tableau Desktop
  • Data Visualization
  • DataRobot

Human Centric & Investigational Skills You Need

[NOTE: This classification term was provided to me by Bridget Cogley, Senior Consultant at Teknion Data Solutions and Data Ethicist. Neither of us liked the term “soft skills” since these skills are often difficult to achieve and more inherent in some people than others, and not everyone can easily learn them. We also did not like the term “non-technical skills” as these skills have an important cohesion and symmetry with a data scientist’s technical skills.]

#1 – Data Ethics

I probably could have started with interpersonal skills or communication skills, but if you lack data ethics, regardless of the number of degrees you have or the number of years you have worked in the profession, you may have the data scientist title, but you are no data scientist.

According to Hilary Mason, a Data Scientist in Residence at Accel, she noted three key challenges facing the data science community.

  • Imprecise Ethics
  • No standards of practice
  • A lack of a consistent vocabulary

We work in a profession with a great deal of uncertainty. Too often, our interactions with our business communities are determined by the algorithms and machine learning developed by data scientists. [6]

Omoju Miller, the senior machine learning data scientist at GitHub, noted in a Harvard Business Review interview by Hugo Browne-Anderson [6]:

We need to have that ethical understanding, we need to have that training, and we need to have something akin to a Hippocratic oath. And we need to actually have proper licenses so that if you actually do something unethical, perhaps you have some kind of penalty, or disbarment, or some kind of recourse, something to say this is not what we want to do as an industry, and then figure out ways to remediate people who go off the rails and do things because people just aren’t trained and they don’t know.

In the same interview, according to ProPublica, a recurring theme is the serious, harmful, and unethical consequences that data science can have, such as the COMPAS Recidivism Risk Score that has been “used across the country to predict future criminals” and is “biased against blacks.” 

Mr. Browne-Anderson also notes that we are approaching a consensus that ethical standards need to come from within data science itself, as well as from legislators, grassroots movements, and other stakeholders. Part of this movement involves a reemphasis on interpretability in models, as opposed to black-box models. That is, we need to build models that can explain why they make the predictions they make. Deep learning models are great at a lot of things, but they are infamously uninterpretable. Many dedicated, intelligent researchers, developers, and data scientists are making headway here with work such as Lime, a project aimed at explaining what machine learning models are doing.[6]

#2 – You really need to understand the data

“I am drowning in data, yet I am starving” – Unknown

Data scientist often deal with very large amounts of data. Often times, we have a lot of data about a subject area, but it is in a form that it is not consumable. As a Data Architect by profession, over the years, I am often surprised when I ask a business partner questions related to their data and they are unsure or do not have an answer to my question. This is not something I am just now experiencing, but something I have experienced over the past 40 years in this profession.

Referring back to Hilary Mason’s three major concerns, I have often walked into situations where no documentation exists, no data dictionary exists, no abbreviation lists (her version of a vocabulary), no data models, no data lineage, etc.

I often use the telephone area codes as an example when discussing the important of being able to understand your data. Back in the early 1980s, if I wanted to validate an area code (I coded in COBOL back then), I could depend on checking the middle digit to be a zero or a one. Arizona only has one area code back then, 602. Now Arizona has five area codes: 480, 520, 602, 623 and 928. My old validation rule no longer will work. I remember when we first had to convert full telephone numbers to the new nomenclature and it was very messy.

As a data scientist, you must be curious. I would even dare to say, you must be more curious about your business partner’s data then they are! It is O.K. to let your guard down and say, “I don’t understand this. Please explain it to me.” Often times, they don’t know the answers to questions about their data either. Your non-functional requirements are as equally important as your functional ones.

Curiosity causes knowledge to occur. I spend most of my time discovering and prepping data. This is starting to change with what I refer to as self-service ETL tools such as Alteryx and Tableau Prep.

#3 – You Have to Read!

I have always been a voracious reader. It sometimes drives my wife crazy all the magazine subscriptions I have. I have always tried to be aware of the latest technologies and trends related to data science.

It is difficult to stay on top of everything related to data science. It embarrasses me to say this, but I am a slow reader. When I was young, my father encouraged me to read a paragraph at a time. He told me once I understood what I just read, to go on to the next paragraph. Repeat. And continue. My life as a life-long reader had begun.

Today, we have the Internet at our disposal. There are a lot of excellent sites to find articles about almost every subject imageable. There is also a lot of “noise” on the Internet. You must train yourself to determine what is valuable to read and what is just noise. I don’t personally have any guidance I can offer here how to discern the two, but it is a kind of gut feel for me when I find something of great value to read, and what is just a vendor preening or someone spewing a biased beef.

I recommend you settle on some key focus areas such as data preparation, data visualization, key statistical concepts, machine learning or Tableau. In my personal technical learnings, the Tableau community is a prolific community. You will find tricks & tips, how to articles, deep discussions on data visualization philosophy, etc. As you read, visit the sites of some of the products they are discussing and read the product information, its capabilities, etc. Even if you probably would not buy or be using that product soon, you at least have the knowledge about it in your “bags of tricks.”

#4 – Know Your Business!

In a previous life, I use to do a lot of interviews, to help our HR Department, for hiring application developers and data modelers. I am going to let you know my favorite question I asked on these interviews. It is:

How does your current company make a profit?

Out of, let’s say, 50 of these interviews over the years, I only had one person who was able to answer this question. Typically, they would respond by telling me about an application they just developed or some key reports they had created for senior management. I would tell them I understand that is what you did from the technical side, but I want to know your understanding of how your company makes money!

So, it is always important that you understand the business of your company. I recommend you read your company’s annual reports, their product descriptions, know the features and advantages of these products, competitive intelligence, etc.

I personally set my goal, when I am working with a department at work, to know as much or more about their department then the people I am working with. Perhaps it is an unrealistic goal, but often times, I come pretty darn close.

In terms of the data science aspect of the business, you should be able to discern which problems the business considers critical. In addition, you should always be thinking of new ways for your business partners to leverage their data to make actionable decisions.

To be able to do this, you must understand how the problems you solve can impact the business. Therefore, as I mentioned earlier, you need to know how your business operates, what it needs to do to make a profit, your business partner’s “pain points”, and what you can do to steer them in the right direction.

#5 – Communication Skills

First, check your technical jargon at the door. You are talking to your business partners, and if you want to get their attention, you need to talk in their language. You need to take all that fancy tech-speak and translate it to language your business partners understand and care about. Think of the departments within your company and the types of things they are interested in. For example, when I talk to our Finance Department, they are interested in being able to quickly access information about next year’s budget, expenditures by department, how much revenue has come in so far this year, and which departments are bringing in the most revenue. A data scientist must be able to provide quantified insights to their business partners for them to be able to make actionable decisions on a timely basis.

Storytelling is a highly desirable skill for data scientists. Being able to weave a compelling story around the business partner’s data will draw them in to wanting to know more details about the story. Also, it will work to help facilitate questions from them, which may help bring new questions and data needs to the surface. Proper use of data visualization and infographics are great tools to convey large amounts of data in understandable, digestible visuals that can quickly be consumed by your business partners, so they gain the knowledge they need to make actionable decisions.

#6 – Be a Team Player

A data scientist does not work in a vacuum. Nor should they want to. In most cases, you will work with individuals within your organization from company executives, product managers, department heads to staff-level employees. Often, you will even have to work with external customers, or in my case, citizens within our City.

When I first entered the IT profession in the late 1970s, I literally worked in the cold, damp basement of the building. A separate group (non-IT) gathered requirements and told us what reports to develop. Once I completed coding the report, I turned that over to the person who gathered the requirements to present to our business partners. If there were any changes, we would have to repeat this entire process over again (often several times) until they were satisfied with the report. Being able to talk directly to the business partners was a no-no for coders. We have learned over the years that this Waterfall, coder in the basement method of development does not work well, and more dynamic, iterative methods will better serve our business partners.

Fortunately, the methods used in IT have matured as well as the business partner’s perception of the IT professional. Iterative, collaborative processes now have IT personnel directly talking to the business partners which helps them hear the needs of the business, as well as being able to ask questions in real-time. Now, this notion of not talking directly to the business may sound silly or antiquated to some of you, but 40 years ago, programmers were relegated to the basement and just coded. Back then, most coders would have loved to talk to the business to get the requirements right the first time versus using a “middleman” to go back-and-forth. Times have changed, IT is more sophisticated, and the data scientist is now front and center with the business partners. Don’t take for granted the evolution of the IT profession to get us where we are today. These relationships with your business partners are critical to your very existence. Everyone around you is part of your team. Treat everyone with respect, remember the information they have is important, and what subject matter knowledge they know is essential to how successful you will be as a data scientist. To coin an old phrase, there is no “I” in “Team.”

Summary

Regarding what a data scientist really is, I see this a little more simplistically. A person who can do the following is what I see as a data scientist.

  • Engage with their business partners to better understand the question(s) that need answering and making their data actionable. They need to be able to speak in business terms and have some level of subject matter expertise in what the business sector does.
  • Be able to determine where the best data sources of the information needed to answer the business question are. This could be internal data sources, government data sources, etc. They also need to be able to determine if the data source is reliable and unbiased.
  • Based on the toolsets they know, they should be able to consume the data sources and produce the required results in a visual, easily consumable format. They should know how to validate the results to ensure they are accurate and correct.
  • They need to have ethics. If the results do not agree with the outcome the business expects, they should not be altered or tweaked to provide a better narrative. Also, if there is confidential data in the results, they need to ensure they obfuscate or redact the information the business should not see based on internal or other data governance regulations.
  • They need to be collaborative. I like to run my results by trusted coworkers to ensure I did not miss something or that I was fair and objective in producing my results.
  • Finally, they need to be able to go back to the business and convey these results in language they will understand. The data scientist needs to be open to criticism of the results, be able to explain their methodology used to create the results and be willing to stand by and support their results even if it does not fit or satisfy the narrative the business wants.

Data Science should be less about the title and more about the skills of what that title encompasses. Degrees, tenure, and titles do not necessarily make a good data scientist. The quality of the work, attention to detail, ethics, accuracy, and the ability to convey the results to the business are what is most important. If these attributes can be met, I don’t care what you call yourself, you are a data scientist.

Sources

[1] Hayashi C. (1998) What is Data Science ? Fundamental Concepts and a Heuristic Example. In: Hayashi C., Yajima K., Bock HH., Ohsumi N., Tanaka Y., Baba Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo.

[2] Stewart Tansley; Kristin Michele Tolle (2009). The Fourth Paradigm: Data-intensive Scientific Discovery. Microsoft Research. ISBN 978-0-9825442-0-4.

[3] Bell, G.; Hey, T.; Szalay, A. (2009). “COMPUTER SCIENCE: Beyond the Data Deluge”. Science. 323 (5919): 1297–1298. doi:10.1126/science.1170411ISSN 0036-8075.

[4] Wikipedia, Data Science, https://en.wikipedia.org/wiki/Data_science.

[5] KDNuggets, 9 Must-have skills you need to become a Data Scientist, updated, KDNuggets.com, May 2018, https://www.kdnuggets.com/2018/05/simplilearn-9-must-have-skills-data-scientist.html.

[6] Browne-Anderson, Hugo, What Data Scientists Really Do, According to 35 Data Scientists, Harvard Business Review, August 15, 2018, https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists.

[7] Cearley, David W. and Brian Burke, Samantha Searle, Mike J. Walker, Top 10 Strategic Technology Trends for 2018, October 3, 2017, Gartner ID: G00327329

[8] Idonine, Carlie, Citizen Data Scientists and Why They Matter, Gartner Research Blog, May 13, 2018, https://blogs.gartner.com/carlie-idoine/2018/05/13/citizen-data-scientists-and-whythey-matter/.

[9] Loshin, David, Empowering the Citizen Analyst: Agile Techniques for Enhancing Self-Service for Data Science, Knowledge Integrity, TDWI Webinar, October 11, 2018.

[10] –, Gartner Says More Than 40 Percent of Data Science Tasks Will Be Automated by 2020, Sydney, Australia, January 16, 2017, https://www.gartner.com/en/newsroom/pressreleases/2017-01-16-gartner-says-more-than-40-percent-of-data-science-tasks-will-beautomated-by-2020.

[11] Moolayil, Jojo John, What is a data scientist?, Quora, April 27, 2015, https://www.quora.com/What-is-a-data-scientist-3.

[12] Smith, Stephen J., The Demise of the Data Warehouse, Eckerson Group, July 19, 2017, https://www.eckerson.com/articles/the-demise-of-the-data-warehouse.

[13] Harris, Jeremie, Why you shouldn’t be a data science generalist, Toward Data Science, November 1, 2018, https://towardsdatascience.com/why-you-shouldnt-be-a-data-science-generalist-f69ea37cdd2c.

[14] Columbus, Louis, IBM Predicts demand For Data Scientists Will Soar 28% By 2020, Forbes Magazine, May 13, 2017, https://www.forbes.com/sites/louiscolumbus/2017/05/13/ibm-predicts-demand-for-data-scientists-will-soar-28-by-2020/#137cc51c7e3b.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.