Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning
Author: Alex J. Gutman and Jordan Goldmeier
Table of contents
Introduction
Data may be the most important part of your job, whether or not you want it to be.
The errors of prediction surrounding the subprime mortgage crisis and the 2016 US general election are key examples of bad data practice.
Data problems occur due to:
 tackling hard problems.
 a lack of critical thinking.
 poor communication.
This book wants to help you:
 Think statistically, and understand variation.
 Become data literate. Speak intelligently about and ask the right question of statistics.
 Understand what's going on with machine learning, text analytics, deep learning, and artificial intelligence.
 Avoid common pitfalls when working with data.
Part 1  Thinking like a Data Head
Chapter 1  What is the problem?
The first step is to help your organisation work on the data problems that really matter.
5 questions to ask when choosing a data problem:

Why is this problem important? 2 warning signs that you are failing to answer this question properly include:
 focusing on methodology  thinking that using some fancy analysis method will set you apart.
 focusing on deliverables  aiming to deliver a dashboard doesn't say anything about what value the project will bring.

Who does this problem affect? And how will their work change? Bring the affected people into the discussion. Do a solution trial run: assume you will successfully complete the project and ask:
 can you use the answer?
 whose work will change?
 What if we don't have the right data?
 When is the project over?
 What if we don't like the results? Meaning that you have the data to answer the question successfully, but the answer isn't the one desired by the stakeholders.
Fundamentally, teams must answer “Is this a real business problem that is worth solving, or are we doing data science for its own sake?”
Chapter 2  What is data?
 Information is derived knowledge.
 Data is encoded information.
Each row of a data table is a measured instance of the entity concerned. Each column is a list of the information you're interested in.
Table rows might be referred to as observations, records, tuples, or trials. Columns might be referred to as features, fields, attributes, predictors, or variables.
A data point is the intersection of an observation and a feature.
Data types can be divided into:
 Numeric, made up of numbers.
 Continuous: can take on any number at all.
 Count or discrete: can only take on whole numbers.
 Categorical, made up of words, symbols, phrases  including potentially numbers in the case of e.g. postcodes.
 Ordered or ordinal: the data has an inherent order, e.g. a survey asking you to rate your experience from 15.
 Unordered or nominal: the data doesn't have an intrinsic order.
Data collection can be described as:
 Observational: collected by passively observing a process.
 Experimental: collected using a methodology based on the scientific method. You randomly assign a "treatment" to something. In a clinical trial for instance you compare a treatment group to a control group. This lets you measure the effect of the treatment without having to worry about confounding features.
Data can be:
 Structured: usually presented in a spreadsheetlike form of rows and columns.
 Unstructured: things like paragraphs of text, pictures, videos etc. Typically this sort of data may need transforming into structured data to analyse.
Summary statistics allow us to understand information about a set of data.
The 3 most common summary statistics are mean, median and mode. These are all measures of location or central tendency. There are also measures of variation  variance, range, and standard deviation  that measure the spread of data.
Chapter 3  Prepare to Think Statistically
A key part of statistical thinking is to ask questions  even when they're about data and claims that we personally like.
Probabilistic thinking, statistical literacy and mathematical thinking are terms often used in the same way as statistical thinking. The key is that they're all about evaluating data and evidence.
There is variation in everything. We don't need to explain every peak and trough.
The 2 types of variation:
 Measurement variation: coming from how we measured or collected the data.
 Random variation: coming from any randomness of the process being measured itself. This can't be controlled but it can be measured via the tools of probability.
Asking people to rate things on a scale is problematic. What one person rates as a 5 someone else might rate as a 10.
Uncertainty can be managed via probability and statistics.
Probabilities drill down, statistics drill up.
If you blindly pick a handful of differently coloured marbles from a bag, then probability will inform your guess as to what's in your hand if you know what was in the bag in the first place. Statistics will let you say something about what is in the bag based on what you happened to pull out.
People typically underestimate variation, particularly when dealing with small numbers. Underestimating variation leads to overestimating your confidence in the data.
The law of small numbers:
the lingering belief … small samples are highly representative of the populations from which they are drawn
We must be cognisant of ow our intuition can mislead us.
Statistics can be either:
 Descriptive statistics: Numbers that summarise data, e.g. mean, median, standard deviation. They deliberately oversimplify data, condensing a whole spreadsheet into a few numbers summarising the key information.
 Inferential statistics: How we can take the information we have at hand to make a best guess about information in the wider world.
Concluding that 75% of Americans believe UFOs exist by asking 20 tourists to a UFO museum should make you sceptical for several reasons.
 Biased sample: These people are likely particularly interested in UFOs.
 Small sample: Implies large variation.
 Underlying assumptions: It assumes everyone at the museum is an American.
Part 2  Speaking Like a Data Head
Chapter 4  Argue with the Data
Your job should include showing leadership by asking questions about the data. If the raw data is bad then no amount of cleaning, statistics or machine learning can compensate for that  "Garbage in, garbage out".
Key questions that help argue with data:
 Tell me the data origin story
 Who collected the data? Always check this when it comes to third party data in particular.
 How was the data collected? This may reveal how valid any conclusions are or ethical issues. Note the difference between observational and experimental data.
 Is the data representative?
 Is there sampling bias? The data you have must be representative of the universe you care about. Asking "Why was this data collected?" can be useful. Observational data should usually be considered biased.
 What did you do with outliers? Not liking an extreme value doesn't mean it should be deleted  you must justify any removal.
 What data am I not seeing?
 How did you deal with missing values? These include datapoints that weren't collected or outliers that were removed.
 Can the data measure what you want it to measure? Usually the data we use is a proxy for what we really want to measure. How good is that proxy?
Data type matters to your approach. e.g. number of incidents is numeric count data so use binomial regression rather than linear regression.
It's a fallacy that larger samples are always more reliable. If the data is biased then simply getting more of it won't help.
Most businesses have an inappropriate culture of acceptance when it comes to data which leads to repeated failures of data projects.
Chapter 5  Explore the Data
Working with data is not a linear process. We should continuously adapt to what we discover within it.
Exploratory Data Analysis (EDA) is this process of iteration, discovery and scrutiny.
EDA reveals the subjective/artistic side of data work. Different teams provided with the same problem and data may choose very different paths. Sometimes their conclusions may not agree with each other.
Stakeholders, managers, and subject matter experts should make themselves available to the data team. Have an open dialogue. Expect iteration. Ensure their assumptions are correct before the whole work is put at risk.
Data teams who go fishing without the correct context can produce work that makes sense statistically, but not practically.
EDA is a mentality, not a checklist or a tool.
3 questions to ask during exploration:
 Can the data answer the question?
 Did you discover any relationships?
 Did you find new opportunities in the data?
First check if the summary statistics match what you already understand about the problem. Visualise the data to spot anomalies that need looking into.
 Histograms let you see how continuous numeric data is distributed.
 Boxplots let you compare data between many groups.
 Bar charts show counts of categorical data.
 Line charts are useful for time series, e.g. spotting seasonality.
 Scatter plots let you see how one variable varies against another.
Watch out for outliers and missing values.
Outliers shouldn't be removed without justification. Data analysts risk chipping away at the data to simplify it to the point where it doesn't reflect the reality of the situation it's trying to capture.
The relationship between variables you can see in scatterplots can be expressed via the "correlation" summary statistic.
Correlation is suggestive of, but not proof of, a relationship between 2 numeric variables.
The most common measure of correlation is the Pearson correlation coefficient. This ranges between 1 (perfect negative correlation) and 1 (perfect positive correlation) and measures the linear relationship between 2 variables. The tighter the points around the linear trend the higher the correlation is.
Correlations can be used to help with prediction or to reduce the redundancies in data you get when two variables contain roughly the same information.
The Pearson correlation coefficient measures linear correlation  but not all trends are linear.
Visualise the correlations you think you found to see the fuller story.
Correlation does not imply causation  but it certainly doesn't rule it out.
Ongoing EDA will enable:
 Development of a clear path to solve the problem.
 Redefining the original problem given any constraints found in the data.
 Identifying of new problems the data can help with.
 The cancellation of any project if the EDA shows it's a waste of time and money to proceed with.
Chapter 6  Examine the probabilities
Probability lets us quantify the likelihood that an event will occur
It's measured by a number between 0 and 1.
 0 means impossible.
 1 means certain.
It's often expressed as a fraction or a percentage.
It's abbreviated to P. You can write the probability of flipping heads on a coin as :
P(C==H) = 1/2 or P(H) = 1/2
Notation like P(D<7) =1 expresses a cumulative probability, the sum of a range of outcomes. Here for example that the probability of rolling less than 7 on a dice is 1.
If the probability of an event depends on some other event, it's called a conditional probability. In the below read the  as "given that".
e.g. the probability that person A is late to work given that they got a flat tire is 100%:
P(AF) = 100%
If the probability of an event doesn't depend on another event then the events are independent.
The probability of 2 events both happening can be denoted with a comma, e.g. the probability of both flipping a head and drawing a spade:
P(H,S)
If those 2 events are independent then you can multiply the probabilities of each individual outcome by each other to get to the overall probability.
But sometimes events aren't independent. The full formula for calculating the probability of 2 events both happening is the multiplicative rule:
P(A, J) = P(J) × P(A  J)
The chance of 2 events happening together can't be greater than either event happening by itself.
To calculate the probability of 1 event or another event happening: if the events can't both happen at the same time then you can add the probabilities. Otherwise, we can use the additive rule. It's important to remember to subtract the overlap:
P(A or J) = P(A) + P(J) – P(A, J)
It's easy for your intuition to mislead you. Some key guidelines to help avoid that:
 Be careful assuming independence. Don't fall for the gambler's fallacy  but also don't assume that things aren't independent when they are.
 Know that all probabilities are conditional. A coinflip being p =50% is conditional on the coin being fair. The probability of your project being successful is conditional on its difficulty, the data quality, whether a pandemic shuts down your company.
 Don't swap dependencies and assume that because P(AB) then P(BA). Bayes theorem gives you the correct relationship: P(AB) * P(B) = P(BA) * P(A)
 Ensure the probabilities have meaning.
Judging future success based on past success can be gamed by the person in question only selecting the easiest projects that are most likely to succeed.
Calculating the probabilities you need to use Bayes theorem can be challenging. A tree diagram can be a useful approach.
Defined probabilities should mean something real. An event with a probability of 75% should happen around 75% of the time, and more often than an event with a probability of 60%. This is the concept of calibration:
Calibration measures whether, over the long run, events occur about as often as you say they're going to occur
Remember that rare events are not impossible and highly probable events don't always happen. You aren't likely to win the lottery, but most weeks someone does.
Don't keep multiplying the probabilities of past events more than is reasonable, otherwise everything will seem highly improbable.
Chapter 7  Challenge the Statistics
Statistical inference lets us make informed guesses about the world based on a sample of data from the same world.
If you run the same political poll several times you will usually get slightly different answers. Reporting in the context of a "margin of error" helps quantify that uncertainty, which caused by variation and chance.
The exact value discovered in your sample (e.g. 65% of people say they'd vote for X) is the point estimate. The interval around it based on the margin of error is called the confidence interval, e.g. (62%, 68%). The hope is that the confidence interval contains the true population value.
Sampling causes variation which causes uncertainty.
The sample size is referred to as 'N'. Bigger samples provide more evidence.
The question you want to ask of the data should be turned into a hypothesis test.
Define a null hypothesis, H0, usually representing the status quo  "my intervention has no effect". The alternative hypothesis, HA, represents the effect you're looking for  "my intervention changes things".
You start from the assumption that H0 is true, and only reject the null if there's enough evidence that shows H0.
The "significance level" is the threshold where you feel the data is no longer consistent with H0. It's a threshold you decide on, which tolerates randomness and variation, but set such that until it's met you can still believe H0 is true.
Confidence level = 1  significance level.
If the p value is less than the significance level then reject the null  the result is "statistically significant".
Because of variation you can still make two types of unavoidable "decision errors":
 Type 1, false positive: when evidence seems to confirm HA but in reality HA is wrong.
 Type 2, false negative: when evidence leads to to accept H0, but in reality HA is right.
"Power" is the probability of correctly rejecting the null hypothesis when HA is true.
You choose the probability of getting false positives or false negatives by setting the significance level and power of your test. It's a tradeoff; reducing the probability of one increases the probability of the other.
Statistical inference steps:
 Ask a meaningful question.
 Formulate a hypothesis test, setting the status quo as the null hypothesis, and what you hope to be true as the alternative hypothesis.
 Establish a significance level. (5% or 0.05 is an arbitrary but oftenused number.)
 Calculate a pvalue based on a statistical test.
 Calculate relevant confidence intervals.
 Reject the null hypothesis and accept the alternative hypothesis if the pvalue is less than the significance level; otherwise, fail to reject the null.
The questions you should ask to challenge the statistics:
 What is the context for these statistics? "Sales are up 10%"  compared to what?
 What is the sample size? Small Ns indicate a lot of variation. Big Ns can still be prone to data quality and bias issues.
 What are you testing?
 What is the null hypothesis? Failing to reject H0 doesn't mean you proved it's true.
 What is the significance level? Convention in many industries is 5% but others go lower. A 5% significance level means you tolerate false positives 1 time in every 20. Decreasing your significance level decreases false positives but increases false negatives.
 How many tests are you doing? If you conduct 100 different tests with a 5% significance level on an intervention that has no effect at all, then around 5 will show statistically significant effects if you do not perform relevant adjustments.
 Can I see the confidence intervals?
 Is this practically significant? Confidence intervals provide estimates of effect sizes. Trivially small effects can be found with large sample sizes  such differences may have no practical value.
 Are you assuming causality?
Part 3  Understanding the Data Scientist's Toolbox
Chapter 8  Search for Hidden Groups
The unsupervised learning toolkit gives us a collection of tools that can be used to discover hidden or unknown patterns and groups within data. Applications include segmenting customers for marketing, organising music or photos.
There are many techniques in this category. Here we look at dimensionality reduction via principal component analysis (PCA) and clustering with kmeans clustering.
Dimensionality Reduction
The dimension of a dataset is how many columns or features it has. Dimensionality reduction seeks to reduce many columns into a lower number that keeps as much information as possible about the data. We're looking for hidden groups in the columns of a dataset that mean we can combine several columns into one.
This is useful because datasets with many columns can be hard to understand or visualise, slow to work with and tedious or even impossible to explore.
If we know what combinations would make sense we can create a composite feature. For instance you might combine create a column in a dataset about cars that replaces the need for 3 others in the following way.
Efficiency = MPG  (Weight + Horsepower)
This combination gave the authors a good spread across their sample data, retaining lots of information, and allowing them to separate out heavy gas guzzling from light fuelefficient cars.
If you don't know which features to combine then you can use principal component analysis.
The PCA algorithm considers all possibilities of combining columns, looking for which linear combinations spread the data out the most, retaining as much information from the original data as possible. These are called "principal components".
Each principal component is calculated such that it doesn't correlate with other ones and hence provides new, nonoverlapping information. In a cars dataset it might for instance discover an efficiency dimension and a performance dimension.
Principal components output shows the weight of each feature that goes into each component. They're measures of correlation ranging between 1 and 1, with extreme values showing the strongest correlations. Many of these dimensions might be correlated because they really measure the same thing. You look for patterns in the weights of principal components in order to come to a conclusion.
Sometimes you can give the revealed composite features a descriptive name, other times not.
If someone presents PCA to you:
 ask to see the equations behind their grouping.
 ask how they decided how many components to keep.
PCA's implicit assumption is that high variance is a sign of something important within the variables. This is not always true  sometimes a feature can have high variance but little practical importance.
Clustering
Whilst PCA groups columns together, clustering groups rows of a dataset together.
Issues:
 How many clusters should there be?
 How do we consider if 2 observations are close to each other?
 How best to group the observations together?
With kmeans clustering, you tell it how many clusters you want (k) and it groups your N rows of data into that number of clusters.
Method:
 To start, the algorithm selects k random locations as candidates for the centre of the clusters.
 Each datapoint is assigned a cluster based on which of the locations its nearest to.
 All points that are in a given cluster are averaged together to create a new centre point, the centroid.
 Each datapoint is now reevaluted to see which centroid its closest to.
 This sequence repeats until the points don't switch clusters any more.
There are several difference measures of distance between datapoints that can be used. Ask which formula was used to measure distance and why.
Sometimes data needs to be scaled. If some features are on a scale that'd much larger than others then they might dominate the results too much.
Hierarchial clustering is an alternative method that doesn't require you to decide how many clusters there are in advance. Instead you bottomup build up groups to form a hierarchy, stopping when you reach the level you desire.
In general, which datapoint ends up in which cluster when clustering depends on:
 which algorithm you use.
 how it's implemented.
 the underlying data quality.
 how much variation there is in the data.
Chapter 9  Understand the Regression Model
In circumstances where you have training data that includes the "correct answers" to learn from you can use supervised learning to find relationships between inputs and known outputs.
A good model will let you make accurate predictions and understand something about the underlying relationship between inputs and outputs.
"Training data" is fed into an algorithm that creates a model.
Regression models output a number. Classification models output a label or category.
Regression models are rooted in an old method called linear regression, specifically least squares regression.
Linear regression computes the line of best fit within data  the line that explains as much of the linear trend and scatter of the data as possible.
You end up with an equation in the form of:
lemonade_sales = (1.03 * temperature)  71.07
The difference between what a model predicts and what actually happens is called its error.
If you sum up all model errors from the training data then they'll cancel each other out and total zero. As such we square every error and sum up those squares, and adjust the slope and intercept of the line of best fit, looking for the model that produces the smallest sum of squared errors (SSE).
You can use the sum of squares to assess how well the model fits the data.
If you start with a sum of squared errors of 34.86 if you just predict the average value for every datapoint and your final model has a SSE of 7.4 then that's a (34.86 – 7.4) = 27.46 reduction. In percentage terms that's a 27.46/34.86 = 78.8% percent reduction. This number is the RSquared or R2 of the model. In this case you can say that the model has explained, described or predicted 78.9% of the variation in the data.
In real life expect low R2s  be suspicious if you see high ones.
Linear regression models are popular partially due to how easy they are to interpret. If the slope coefficient for a variable is 1.03 then that means for every 1 increase in that variable the output prediction goes up by 1.03.
If you took a different sample from the same population you'd likely get slightly different coefficients. There is some natural variation. So we test each coefficient against the null hypothesis that it is equal to zero. If there are no significant differences detected then you can remove that feature from your model.
Regression with one input is called simple linear regression. If it involves several inputs it is multiple linear regression.
Multiple regression lets you isolate the effects of one variable by controlling for the others, so you can say things like "if all other inputs are held constant, a home built one year sooner adds on average $818.38 to the sales price."
Always take account of the units of the variables.
Some pitfalls of linear regression:
 Omitted variables: Models can't learn how inputs and outputs relate for variables that aren't supplied as inputs. Subject matter expertise is critical in terms of selecting the right data to include. "Time" is a common omitted variable.
 Inappropriate causality; Labelling a variable as an input to a model and another as an output doesn't mean the former causes the latter.
 Multicollinearity: If your goal is interpretability then you want to avoid this. You can only isolate the effect of one input's contribution while holding every other input constant if the underlying data itself is uncorrelated. Multicollinearity is present in most observational datasets. Experimental data is usually designed to prevent it.
 Data leakage: You must not use what is really an output variable as an input variable. Don't use data that is only accessible after whatever you're trying to predict has happened.
 Extrapolation failures: This happens when you try to predict something that's beyond the range of the data you used to build the model. The model will give you an answer for any set of numbers, but it may not be useful. The data you make predictions on should fit within the range of the training data and also be from the same context.
 Many relationships aren't linear: The stock market typically grows exponentially for instance which basic linear regression won't work well with. There are tools to transform nonlinear data into a linear form. But sometimes linear regression is just not the right tool for the job.
Linear regression models can explain or predict. If the goal is explanation then be very wary of multicollinearity and omitted variables. If the goal is accurate prediction then those issues might be less important  the dominant concern here should be to avoid overfitting.
An overfit model captures the noise and variation within the specific sample it was trained on rather than the general relationship that underlies it. They do not generalise well to new observations. To avoid this, split the data into a training set that's used to build the model and a test set that's used to validate performance.
The best way to judge how well a model fits the data is to look at an "actual vs predicted" plot.
LASSO and Ridge Regression are variations of linear regression that might help when there is multicollinearity or you have more input variables than observations. Knearest neighbour technique can also be applied to regression problems.
Chapter 10  Understand the Classification Model
We use classification models where the goal is to predict a categorical variable or label.
Predicting which out of 2 outcomes happens is called binary classification. If there are >2 outcomes this is multiclass classification.
Outcomes that are described as positive vs negative should correspond to "does" vs "does not" generally.
Logistic Regression
It's often useful to predict the probability of something happening (e.g. an applicant being offered an interview based on their GPA). This means you have to constrain the output of a linear equation (y = mx + b) to lie in the range of 0  1. Logistic regression does this in order to give you the predicted probability of something belonging to the positive class.
Logistic loss is minimized, such that that the predicted probabilities are close to the actual labels.
Like linear regression, logistic regression gives us a way to explain and a way to predict.
As it provides a probability then if you want to make a decision based on the output of logistic regression you will have to set a cutoff (aka a decision rule). For example: if the predicted probability is >50% then we predict that the event will happen. This should be done with the help of domain experts and will be influenced by the nature of the decision being made.
Watch out again for omitted variables, multicollinearity and extrapolation.
Decision trees
Decision trees are an easily digestible alternative to logistic regression than doesn't rely on the y = mx + b model. These trees end up giving you a list of rules to guide your predictions in a similar form to a flowchart.
The decision tree algorithm searches for the input feature and value that best separates out observations based on the outcome you're interested in. It repeats this to provide ever more granular splits until you have enough.
CART is an example of a decision tree algorithm.
They're an easy way to display exploratory data and check that your inputs have a relationship with the output.
However decision trees are prone to overfitting, even when techniques such as pruning are used. Instead we can use multiple trees together.
Ensemble methods
Ensemble methods represent the aggregation of many different results obtained by running an algorithm several times.
Data scientists currently favour random forests and gradient boosted trees.
Random Forests
The algorithm takes a random sample of your data and builds a decision tree. The process is repeated hundreds or thousands of times. The resulting "forest" produces a prediction based on the consensus of running all the trees. Whichever outcome is the one most trees point to is the one the forest shows.
Random forests randomly select both which observations (rows) and which features (columns) to build a tree with.
Gradient Boosted Trees
Gradient boosted trees build trees sequentially.
The first tree is a shallow tree with few branches and nodes, and is thus quite weak at prediction. The next steps sees a new tree being built on the errors of the first tree, boosting the observations that had large errors. This is repeated potentially thousands of times.
Ensemble models require a large number of observations, at least hundreds.
They are hard to interpret, essentially being black boxes at a certain scale.
Pitfalls
 Misapplication of the problem: e.g. if you're trying to predict a categorical variable then don't use linear regression.
 Data leakage
 Not splitting your data: into a training and test set, risking overfitting and poor future predictions.
 Choosing the right threshold: Most classification models output a probability of belonging to a given class. The cutoff probability should be a human decision, not always defaulting to 50%. It's heavily reliant on what problem you're trying to solve, including which direction of errors you'd be more comfortable with.
 Misunderstanding accuracy
You need to know how to judge any model.
Typically you test it against a control model which is simply a model that always predicts the result that is most common within your dataset.
Accuracy as defined by the % of correct predictions is often a poor indicator of model performance, especially for rare events. Often you care more about performance on predicting true positives and true negatives You can use a confusion matrix which helps you visualise the results of both the model and the decision threshold you chose.
Measures it shows include:
 Accuracy: the % of predictions that were correct
 True positive rate (aka sensitivity, recall): the % of observations that were predicted to be in the positive class divided by how many actually were in that class.
 True negative rate (aka specificity): the % of observations that were predicted to be in the negative class divided by how many actually were in that class.
The higher the result the better for all the above.
Increasing the cutoff will lower the true positive rate and increase the true negative rate.
Chapter 11  Understand Text Analytics
Most data you interact with each day is unstructured text  found in emails, news articles, product reviews etc.
For computers to process unstructured data it must first be converted into numbers and more structural datasets. This process can be subjective and timeconsuming. Three ways to do that are outlined below .
A Big Bag of Words
The individual words included in sentences of text are extracted and jumbled together in a "bag". The set of words for a given instance of an entity (e.g. a sentence) is called a document. Each word is an identifier and the count of times each word is used is a feature.
Each identifier is called a token. The set of all tokens from all documents is a "dictionary".
A documentterm matrix (DTM) is a table which represents one document per row and one term per column with the intersection being the count of usages. From this it's easy to calculate summary statistics such as which word is most popular or which documents have the most word.
Word clouds are good for marketing but hard to interpret  often it's better to represent word frequency usage in a bar chart.
DTM tables tend to be very sparse because most sentences do not contain most words.
To help alleviate that it's common to:
 remove "stop words": are common filler words like the, of, a is etc.
 remove punctuation.
 remove numbers.
 transform everything to lower case.
 stem words: cutting off their endings, e.g. reading, reads, read all stem to "read".
 conduct lemmatization: advanced form of stemming that maps words e.g. good, better, best to a root word "good".
But this process of treating words in isolation filters out emotion, context and word order.
NGrams
A Ngram is a sequence of N consecutive words. It extends bagofwords so as to distinguish different phrases that have the same sequence of words. The DTM becomes even larger and sparser.
Whether you should remove stopwords from the ngrams is debated.
Word Embeddings
 Bag of words and NGrams let you determine whether documents are similar (e.g. if they contain similar sets of words).
 Word embeddings let you establish which words in a dictionary are related to each others.
e.g. If "beef" and "pork" often appear alongside the word "delicious" then the math represents them as being similar as an element of a vector. The vector might then represent something like "food".
If the dictionary consists of {beef, cow, delicious, farm, feed, pig, pork, salad} Then cow is represented like: (0, 1, 0, 0, 0, 0, 0, 0)
A supervised learning algorithm then takes that input and maps it to its associated output vector (the dictionary) with the probability that other words in the dictionary were found near it.
The output might be (0.3, 0, 0, 0.5, 0.1, 0.1, 0, 0), to show cow was paired with beef 30% of the time etc.
The resulting table of numbers shows how each word in the dictionary relates to every other word  a numeric representation of the "meaning" of the word.
Word2vec can be used to do this.
Topic modeling
Once text has been turned into data as above we can use variations on standard analysis methods to process it.
Topic modelling is an unsupervised learning algorithm that groups similar observations together to provide probabilities that each document relates to each cluster  how one document spans several topics. It works best when your documents have several disparate topics.
Text classification
Usually we want to predict a categorical variable such as "Is this email spam?".
A common algorithm for this is Naïve Bayes. It calculates the probability that an email is spam based on the words in its subject line.
The training dataset lets us calculate p(words in subject  spam) so we use Bayes to calculate p(spam  word in subject).
A downside of this algorithm for this usecase is that it incorrectly assumes independence between events.
Sentiment analysis
A similar idea only this time we want to classify whether words in e.g. a product review are positive or negative.
Be careful not to extrapolate beyond the context of the training data. Training on e.g. product reviews will not usually make for a good model in other domains.
Tree based methods can also be used for text classification and may outperform Naive Bayes  but are harder to interpret.
Considerations when working with text
 Read the data. If topic modelling classifies a sentence as belonging to a topic check that it makes sense.
 Ask to see examples where the algorithm misclassified something.
When companies analyse their text data they're often disappointed with the results. The reason big tech companies like Google can do this so well as because they have huge amounts of labeled text and voice data, powerful computers, worldclass research teams and lots of money. They've made good progress in:
 Speech to text
 Text to speech
 Text to text (e.g. translating between languages)
 Chatbots
 Generating humanreadable txt (generative AI)
Your company's data is likely to be smaller and may contain language that's unique to your company.
Chapter 12  Conceptualize Deep Learning
Deep learning helps drive decisions that were once considered to be in the domain of humans, such as facial recognition, autonomous driving, cancer detection and language translation.
It uses a set of models known as artificial neural networks. These algorithms are designed to mimic the way a brain works, but they've many limitations and differences. In reality they're just math equations. Their success comes from the advent of faster computers, more data and research in fields such as machine learning, statistics and mathematics.
In artificial neural networks, values flow into a computational unit called a "neuron". An activation function converts those inputs into a single numerical output. The system is trained on labelled data. The goal of the network is to find the values for the weight and constant value parameters (often represented by w and b) that make the predicted outputs from the network as close to the actual output values in aggregate.
The system starts off assigning random values to parameters which makes for terrible predictions. An algorithm called backpropagation changes the values of the parameters based on how close the predictions were to the real answers. This repeats until over time the parameters converge towards their theoretical optimum in terms of producing correct predictions.
The main benefit of neural networks comes when you add "hidden layers" to the networks. The neurons in those layers will learn new and different representations of the input data that aid prediction, determining the combination of fields that have the most effect on the output correctness (conceptually similar to PCA). You can think of the network as a series of logistic regression models, one in each neuron.
The hidden layer neurons produce their own outputs which are fed into the next layer of neurons until a final prediction is produced. In a simple network, these interim features might represent understandable concepts like "achievement" or "experience" (if predicting job application success) but usually they're not easily interpretable which makes for a black box model.
The result is a huge equation with many parameters that allow the model to identify complex representations and make nuanced predictions.
This technique can reduce the need for timeconsuming manual feature engineering (the process of transforming raw data into new features using subject matter expertise).
The performance of large and deep neural network tends to improve with data size  but only if there is some meaningful signal in the data in the first place.
Computers "see" images by converting pixels of a picture into a numerical value (e.g. 0 representing white and 255 black for a monochrome image). The model can be trained to predict e.g. the number that an image of handwriting represents with the above technique.
Colour images are represented by 3D matrices for pixel values corresponding to red, blue and green.
This process is very computationally intensive. Convolutional neural networks help with analysing large or colour images by mathematically doing calculations on localised sets of pixels and pooling the results together, trying to filter out information that isn't relevant. This reduces the number of values that have to go into the neural network and also allows for the ability to search for similar features across images.
Recurrent neural networks have powered advancements in processing language and other sequences.
To predict the next word, such a system can be trained on millions of inputoutput pairs of intersecting sequences of words. The system can "remember" the earlier words in a sentence and hence can make useful predictions that take into account the word order.
In practice most companies may not have enough labelled training data to make for a good model. If this is the case, it might be possible to use "transfer learning" where you take a model that has already been trained for a certain task, remove the final few layers of it and replace them with new layers based on the data you want to train on.
Decisions when setting up a deep neural network include:
 How many layers the network should have.
 Have many neurons per layer.
 Which activation functions to use.
There are 2 types of artificial intelligence:
 Artificial General Intelligence: representing complete human cognition. Not much progress has been made on this.
 Artificial Narrow Intelligence: systems that can do one thing well, e.g. facial recognition. Great strides have been made here based on machine learning.
Deep learning is a subset of machine learning is a subset of AI.
AI reinforces patterns from data collected in the past; it's not creating consciousness.
This can cause issues when we incorrectly believe that data represents perfect truth and confuse ourselves that algorithms replicate our own decisionmaking abilities.
Generative Adversarial Networks (GANs) can be used to create deep fakes such as images showing someone doing something they never did.
It's hard to explain an equation with millions of parameters, even when they're used for lifechanging realworld decisions e.g. for sentencing criminals.
Data often comes from people, including aspects of their identity. We should not simply assume that society has approved our use of any available data.
That we can collect certain features and run algorithms doesn't always mean we should
Ask "who does this result affect?".
Part 4  Ensuring Success
Chapter 13  Watch Out for Pitfalls
Understanding data is in many ways about knowing what mistakes can happen.
Biases and weird phenomena in data
Bias here means:
...the lopsided (and sometimes even inconsistent) favorability given to ideas and concepts by individuals and reinforced in groups.
Survivorship bias: The "logical error of concentrating on people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility".
Regression to the Mean: Extreme values of random events are often followed by less extreme values.
Simpson's Paradox: When a trend or association between variables is reversed after a third variable is incorporated. When this happens, not only might you be tempted to mistake correlation for causation, but your correlation is also wrong. The best way to mitigate this risk is to collect experimental data i.e. randomly split observations into each treatment group.
Confirmation bias: Interpreting data in such a way that your existing beliefs are confirmed, whilst ignoring any conflicting evidence.
Effort bias, or the Sunk Cost Fallacy: Once much time and resources have been invested into a project it can be hard to cancel it even if you come to realise that you don't have the right data, technology or scope for the project to produce something useful.
Algorithmic bias: Especially with regards to decisions made via machine learning, there's a kind of embedded prejudice built into the data, usually reflecting the status quo. It may be hard to detect unless you fundamentally challenge how things currently are. All models involve assumptions. All observational data has bakedin bias.
When models make predictions, they perpetuate and reinforce underlying bias and stereotypes already manifest in the data
The Big List of Pitfalls
This list, and all lists, are not exhaustive.
Statistical and Machine Learning Pitfalls:
 Mistaking correlation for causation.
 phacking: the process of testing multiple hypotheses in data until you find a statistically significant p value.
 Using nonrepresentative samples: election polls from a sample that doesn't represent the voting population will be wrong.
 Data leakage: training models on data that isn't available at prediction time.
 Overfitting: where the model performs well on its training data but badly on new observations.
 Nonrepresentative training data.
Project Pitfalls:
 Not asking a sharp question, solving the wrong problem.
 Not adapting the question once it's already failed.
 A culture of "owning" data as opposed to governing it: making it hard for data workers to acquire it.
 Data that doesn't contain the information needed.
 Not considering cheap and opensource technologies.
 Overly optimistic timelines.
 Inflated expectations of value: don't oversell what value your project can provide.
 Expecting to predict the unpredictable: no amount of data on roulette wheel spins will help you predict the next spin.
 Overkill: if someone already knows the business rules to automate a process, have them write them down. Don't bother creating classification algorithms to identify what is already known.
Chapter 14  Know the People and Personalities
The Seven Scenes of Communication Breakdowns:
 The Postmortem: A senior data scientist is brought in to get a project back on track long after early warning signs. It's too little too late for the project.
 Storytime: A smart analyst strips his presentation of technical nuance to satisfy the myth that they must explain stuff to higherups like they're children—and the analyst feels like they're betraying their role as a critical data thinker.
 The Telephone Game: A preliminary statistic, manifest of code and data science work, is taken out of context and then shared so widely it loses whatever little meaning it originally had.
 Into the Weeds Results are so technical as to be rendered meaningless. The resulting deliverable is more selfindulgence than a true presentation on what happened.
 The Reality Check The data worker pursues perfecting an impractical solution and does not consider alternatives until challenged by authority.
 The Takeover: A data scientist attempts to try to solve major underlying business problems without establishing team trust, rapport, or focusing on quick wins.
 The Blowhard: A data scientist finds fault with virtually all work that isn't his. As a result, he is no longer sought to support projects.
The failures that drive each of those scenes come from showing a lack of empathy and respect for everyone's contributions. In each of the above, one of the following happened:
 The business side failed to appreciate the work or challenges of the data side.
 The data side failed to appreciate the work or challenges of the business side.
 The data side refused to move beyond a technicalonly role.
Much advice about getting businesses ready for data concerns investing in technology and training  but many failures occur due to poor communication.
Some personalities you might encounter include:
 Data enthusiasts: These people are too enthused by the hype. They think data can solve every problem and that seeing results based on data and charts mean that a thorough and scientific analysis must have been carried out. Encourage their love of data, but remind them that data can't do the impossible and that they should have a healthy level of data skepticism.
 Data cynics: These people think their personal experience matters more than any data analysis and do not respect the contributions of data workers. Consider why they're cynical  perhaps it's partly justifiable. Show empathy by listening to their concerns and values. Show that you're including their domain expertise and values into your data solution.
 Data heads: Fundamentally these are data skeptics, although the skepticism is based on employing their data criticalthinking skills rather than just to be annoying. They advocate for data where it's useful, but question what ought to be questioned. Their skepticism comes from having technical knowledge and domain expertise. It's delivered with empathy.
Chapter 15  What's Next?
To be an effective Data Head you must use data to drive change.
Some ideas to consider:
 Create a Data Head working group in your institution.
 Hold regular meetups or lunchandlearns that dive into some of the topics in this book and beyond.
 Commit to share your knowledge and help others.
These days much of your learning must happen offthejob  books, online learning, certificates etc. We have moved to a cheaper delivery of training and the onus on being informed has been shifted to you.
Whilst many of the topics above are relevant to new technology, the fundamental problems they present to businesses aren't new:
 Poor quality data.
 Faulty assumptions.
 Unrealistic expectations.