Conflicting Terminology

Bad terminology is the enemy of good thinking.

Laymen, economists, statisticians, data scientists and machine learning folks tend to use the same terminology. Engineers and mathematicians in most tend to use another terminology than statisticians, data scientists and machine learning engineers. The terminology used in machine learning and data science is taken from the terminology used in Statistics. Myself, I prefer to use the terminology used by engineers and mathematicians, but sometimes I tend to interchange terminologies. To read more about conflicting terminology, please follow this link.

Data Professionals

War is 90% information.

Data professionals are data scientists, data analysts and data engineers who are professionals in data science. Usually the wide term data professionals and business analysts are wrongly coined as data scientists. However, data professionals can be split into the 3 just mentioned categories.

Data science continues to evolve as one of the most promising and in-demand career paths for skilled professionals. Today, successful data professionals understand that they must advance past the traditional skills of computing and analyzing large amounts of data, data mining, and programming skills. In order to uncover useful intelligence for their organizations, data scientists must master the full spectrum of the data science life cycle and possess a level of flexibility and understanding to maximize returns at each phase of the process. The term "data scientist" was coined as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data. Data professionals are well-rounded, data-driven individuals with high-level technical skills who are capable of building complex quantitative algorithms to organize and synthesize large amounts of information used to answer questions and drive strategy in their organization. They possess a strong quantitative background in statistics and linear algebra as well as programming knowledge with focuses in data warehousing, mining, and modeling to build and analyze algorithms.

Data scientists examine which questions need answering and where to find the related data. Another suitable word is data professional lead. They have business acumen and analytical skills as well as the ability to mine, clean, and present data. Businesses use data scientists to source, manage, and analyze large amounts of unstructured data. Results are then synthesized and communicated to key stakeholders to drive strategic decision-making in the organization.

Business Analysts primary job responsibility is to analyze and validate different requirements and to communicate with all stakeholders. The different requirements to analyze and validate are the requirements for changes to business processes, information systems, and policies. A professional business analyst plays a big role in moving an organization toward efficiency, productivity, and profitability.

Data analysts bridge the gap between data scientists and business analysts. They are provided with the questions that need answering and then organize and analyze data to find results that align with high-level business strategy. Data analysts are responsible for translating technical analysis to qualitative action items and effectively communicating their findings to diverse stakeholders.

Data engineers are experts who manage exponential amounts of rapidly changing data. They focus on the development, deployment, management, and optimization of data pipelines and infrastructure to transform and transfer data to other data professionals for querying.

Needed Skills

The zoologist brought her data scientist to the djungle – since he is expert on python and pandas.

SAS, R, Python are desired programming languanges for statistical computing. Python is more performant than R but R has a lot more statistical packages than Python. With Python one is usually required to have skills in using Pandas. Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. SAS is popular with banks but gives less versatility than coding in R and Python. Python is a general purpose programming language while R and S are domain specific programming languages.

Data wrangling is about cleaning and preparing data before processing and analysis.

Story telling is extraction of useful value from computed data. Once a business has started collecting and combining all kinds of data, the next elusive step is to extract value from it. The data may hold tremendous amounts of potential value, but not an ounce of value can be created unless insights are uncovered and translated into actions or business outcomes.

SkillData ScientistData AnalystData EngineerMe?
SAS, R, PythonXXX
Data VisualizationXXX
Data WranglingXXX
Machine LearningXX
Story tellingXX
Salary (k USD)113,43662,453137,776

Machine Learning

Monkey see, monkey do.

Machine Learning is when computers learn by themselves without guidiance from humen, well there may be some guidance from humen. From where does Machine Learning stem from? Very few know that AI and Machine Learning has its stem from military research on subversion, brainwashing and mind control. The Canadian psychologist Donald Hebb, father of neuropsychology and neural networks, presented already in 1949 Hebbian learning (long before the existence of computers!) and has according to authors of the book Sensory Deprivation: A Symposium Held at Harvard Medical School (1961) written:

"The work that we have done at McGill University began, actually, with the problem of brainwashing. We were not permitted to say so in the first publishing.... The chief impetus, of course, was the dismay at the kind of "confessions" being produced at the Russian Communist trials. "Brainwashing" was a term that came a little later, applied to Chinese procedures. We did not know what the Russian procedures were, but it seemed that they were producing some peculiar changes of attitude. How? One possible factor was perceptual isolation and we concentrated on that."

Big Data

Data is like people – interrogate it hard enough and it will tell you whatever you want to hear.

With big data we reefer to the study of data sets too large and too complex for traditional software to deal with. Big data uses techniques of distributed computing as its backbone and one existing big data application is from '60s which is the US surveillance program ECHELON. One could see big data technology as a subset of distributed computing since distributed computing requires shared data sets and distributed data flows. Another big data example, is Facebook, having billions of users, but still sending a message within Facebook is relatively instantaneous having upto a billion active users. Facebook has solved problems dealing with fast transmissions of data in massively distributed data storage's. Clearly, dealing with big data is the data engineers task but also is the data scientists role depending on the task.


There are two kinds of data scientists. Those who can extrapolate from incomplete data.

In the needed skills table I have added myself. As can be seen, I have mixed myself into a desired combination of HPC scientist, data scientist and data engineer. The amount of data is growing faster than computing power and soon there will be a strong industrial demand for people with skills to compute big data with massive performance. A computational engineer has the abililty to support a data engineer and a computational scientist has the ability to support a data scientist. Even thought if data is small enough that scaling out with hardware is still possible, a numerical analyst is still of exceptional use for its skills in mathematical modeling.