Today we are not dealing with the problem of data availability. We all figured ways to collect data. Be it from surveys or other forms of collecting data. The challenge is to make sense of the data collected, and drawing insights from it. Quite a few approaches have emerged over the years to deal with this challenge.
One of the popular models that is in vogue even today is the OSEMN framework. Originally published by Hilary Mason and Chris Wiggins in the post Taxonomy of Data Science, this framework gives us the sequence of activities that you perform as you go about your job as a Data Scientist. Whether you call your self one, or not is up to you, but the fact that you are reading this post already sets you on the path to becoming one.
Let’s start by unpacking OSEMN – the acronym expands to the following:
Obtain data from multiple sources
Scrub it so you can get meaning out of it
Explore using Statistical methods
Model predictions, behaviors
Nterpret (read interpret) results and present them
Sounds obvious, you need data to even get started, right? However, unless you have an army of Data Engineers ready to take up the load of scrubbing it later, you may want to be mindful of the sources you are collecting the data from, in addition to ensuring the entry of data is smooth and devoid of errors.
Depending on your objectives, there are many possible sources of data. For many a problem, looking up existing public data could be a great start. It helps in formulating a hypothesis that is sharp and practical to prove. Public data also helps you in coming up with useful data points to collect from your own research.
Once you have zeroed in on the hypothesis and have a fair idea of the data points you want to collect, it’s time to pick the method for data collection. Data can be collected by running user interviews but that could be expensive. Conducting surveys, on the other hand, may be quite cost effective. What you choose comes down to the number of respondents you have, the quality of responses you expect, and the type of audience.
Interviews are a great choice if the number of respondents is smaller, you have an easy access to them, or if you think the respondents need to be guided (or probed further based on their responses) as they answer questions.
Surveys, while being cost effective, are a great option if the audience is mature, are capable of answering questions, and are either too busy or not available for interviews. Also they are a great option when the number of respondents is high. It generally helps to make it worth the respondent’s time to fill out the survey. You don’t need to pay them or reward them with gifts every time; quite often the respondent either has a strong feeling towards helping the research, or is interested to know the outcomes. They may be happy to cooperate and fill out the survey when they know they can get access to the research findings.
Despite your efforts to get a set of data that is ready to analyze, you mostly end up with data that needs clean up, and a cleaned data set is a pre requisite for any analysis you set out to perform. Before we discuss the techniques of handling un clean data, let’s look at what problems typically plague data.
A typical problem with survey data is partly filled responses. Survey creators often tempted to make all survey questions mandatory to fill. However, they know too well that such tactics back fire and lead to respondents turning away. A good way to prevent this problem to some extent is to do a test run of the survey and see what questions are causing most amount of friction and work on them.
Change the way they are written (play with the tone and the language), change the question type (change from type the answer to picking from options, or choosing from a scale), or if possible trim the number of questions. Any change made to reduce the effort required on the respondent to fill the survey greatly increases the quality of responses.
Incorrect / Junk values:
Imaging seeing a phone number field carrying a zip code, or the respondent name field carrying their city name. These are not uncommon and there are also instances where wrong values are entered – a 230 value for age, or 168 for respondent’s height in feet. These are more obvious ones, but there could be incorrect values which are harder to spot – for example, wrong city, or product picked.
As always, prevention is better than clean up. While creating the survey, be mindful of possible mistakes, and restrict the type of data that can be entered for a question. When collecting age, make sure only a number can be entered. When collecting a phone number, include the country code. If it is an email address, run a small check for the format. EV Surveys enables you to do all of these; sorry, a selfish plug there.
Free form versus Quantitative response Imbalance:
When a survey has too many questions that are framed towards a free form response, compared to numeric and choice based responses, further analysis becomes extremely hard. For example, when asking for feedback, it’s better to use NPS and then follow up for more comments, rather than asking for feedback in a free form type.
Knowing that quantitative analysis requires all response data to be available as categorical, interval or ratio form helps frame questions in the right form. You can check the little book of Statistics for more on the types of data.
Now that we have covered how to prevent problems with survey data, let’s get to the fixing part. Despite our best intentions, data does come in unexpected forms and we still have to deal with anomalies before we can analyze it.
Dealing with missing values:
Categorical values like names, cities, etc. are a lot harder to fix than numerical values. In case of categorical values, it’s better to take a call on retaining the response based on how important the missing value is to the overall analysis. It is fine to drop the responses if they are missing a key piece of information.
Data Imputation – For the numerical values you could simply use a mean or median of the values you have with you. As an example, if age is missing – get the mean age and replace the missing values with it. In some cases you could adjust the mean with some weight based on other parameters. For example – use the height to sort of guess the missing age information. Say the height is 5 ft 6 in, you know the age could be more than 15 and not just go by the mean.
Dealing with incorrect / Junk values:
The tricky part is in digging these out from the data haystack. It’s not impossible but certainly tests your patience. Having an eye for detail certainly helps, but you could use a few tactics to help yourself.
- Start by looking for form compliance – junk email addresses, phone numbers, unexpected location information are usual suspects, but look at each field and see if you can come up with an expectation on the form of the value.
- Any knowledge you have about your audience comes handy. Look for unexpected values in the categorical responses by extracting unique values from the data. A mere glance at these unique values should give you some hint.
- For the numerical values you could start by looking at the outliers. Consult with the rest of the response for these outliers before you form an opinion.
- Numerical values can also be plotted against each other – age versus height, income etc. for example, only to throw more light at possible junk values.
- Be merciless in getting rid of responses with junk values because they can skew your analysis in unwanted ways.
A common technique used to level the data at hand (pardon the bruteness) is encoding. Consider this example to understand this better.
|Year||Season||Day||Time||# birds spotted|
The last column in the table above is a set of clean numbers. It means you can twist it, turn it, break it as you like it – sum, count, average, mean, median, minimum, maximum, the list goes on.
The first four columns aren’t as malleable. It’s obvious, right? You can’t get sum, average etc on the Year data, or the Season data. Does it mean we leave them like that? No way! Because they are valuable pieces of data and we have to make good use of them.
When analyzing categorical data, it’s a common practice to encode the categorical data into a number format. This way you can apply statistical analysis on the data and draw insights. Here is how you can encode the data in the table above.
Year – 2020 can be encoded into 0, and 2021 as 1.
Season – Winter as 0, Spring as 1, Fall as 2, and Summer as 3.
Day – the Days of the week can be numbered from 0 to 6. Sun (0), Mon (1), Tue (2), … Sat (6)
Time – Morning can be a 0, and Evening can be 1.
The encoded table looks like this:
|Year||Season||Day||Time||# birds spotted|
Now it is possible to establish relationship between the Year, Season, Day, and Time with the Number of birds spotter. This allows you to predict the number for different combinations of Year, Season etc.
In summary, scrubbing the data is a painfully hard task and you can never feel you are done. Choose a point at which you convince yourself to move on – if you don’t end up with enough data to analyze, take a call on going back to data collection, of course, with changes to prevent problems. Or move on to Analysis after you have done a due diligence on your data.
Explore for insights
Time for the most exciting part of the process. We would look at EDA – Exploratory Data Analysis for this. EDA is not a framework or a set of techniques. It is an approach. An approach that brings to the fore what you did not expect to see. But to see things this way you need visualization. By plotting your data into bar charts, line charts, scatter plots etc. you can unearth valuable insights.
EDA and Data Visualization are vast areas that experts have covered with many books. We encourage you to pick up any book on the topic and explore further – what you gain out of that could be the single biggest difference you could make to your business.
Start by picking a tool of your choice – could be a spreadsheet software like Microsoft Excel, Google sheet, etc., or a sophisticated data visualization tool like Tableau, or Power BI.
Carefully look at the distribution of all numerical values – ages, sales volumes, quantities, durations, prices, all fall into this category. A distribution gives you an idea of the following:
- what is the range of the data – from the minimum to the maximum
- how concentrated or spread out the data is – is most of the data close to the mean? How much does it extend away from it?
- Is the data skewed to one side? Are more values left or right skewed?
Try answering these with a visual like the one below.
Having answers to these helps you start getting a metal picture of your survey response data – who is the typical respondent, what they buy, what they appreciate, etc. not as an opinion but fully backed by data.
As a next step, try pitting one column of data against another – this is typically done in two ways.
Using two numerical columns – like age versus weight, discount versus sale volumes, marketing spend versus leads, etc. This will help you see if there is a relation between the two and how they are related. For example, you may see that more discounts take sales higher only to a point, but after that it flattens out.
Similarly you could see the relation between a categorical column like occupation and a numerical column like sales. This could help you see which segments of the market are more interested in your offering and tailor your communication towards favorable groups and dig into why others are not as keen.
Lastly, you could also see the trend over a period of time. This is particularly helpful because you could mark any triggers for possible change and see how they impact the outcomes. For example, when you plot the daily traffic to a website for a month, you can see if a PR campaign had any impact on it.
Now, this mostly refers to Machine Learning based models, but let’s continue to keep things simple. Time for sticking your neck out and come up with both recommendations and predictions. You may use the insights you dug out from the previous steps to do this. What is important is that you have spent enough time putting each hypothesis to test as a part of EDA and pick the top few recommendations.
Let’s look at a few examples.
Based on all the analysis you could recommend a 20% discount to be run during the winter season.
Another example could be a prediction that the sales for the next month would drop by 10% if status quo is maintained.
These will have to pass the test of time but before that they need to be backed by the right authority. This is what gets us to the next step.
iNterpret Results and Present them
We are talking about a special skill here – Storytelling. This could be the single most important skill of the century. That’s for another discussion.
Right now we are talking about presenting your recommendations and predictions to the decision making body that has the authority to sponsor them. This would require a few key ingredients:
- building a context so the audience is on the same page with you
- a brief walk through of the journey so far to give them confidence in the process followed – include some details on the choice of audience, data collection method, challenges faced, how you dealt with them, assumptions, and how they were validated etc.
- then you could give them a taste of your analysis with some hand picked insights that allow you to build the basis for the recommendations and predictions. Include visuals that aid you in reinforcing the message.
- lastly, share the recommendations and predictions and seek feedback with an open mind
- importantly, deal with questions with data backed responses
If you thought this is a linear approach that lets you get through the whole thing and be done with it, you couldn’t be more wrong.
This is an iterative process as you see below.
Back in 2013, Joe Blitzstein while answering a question on Quora shared an approach that he and his colleagues used while teaching at Harvard.
Like the picture shows you iterate between each step to improve your model at the end of the process before you either feel no further need from the model.
That was a long one, but hopefully gave you a perspective. Let us know your comments. Thanks!