End-to-End BI Project: Strategy, Steps, Processes, and Tools [Part-02]
In this small-scale end-to-end Business Intelligence (BI) project, we will start from data gathering and data exploration to understand how they would help in accepting or refuting the key metrics.
In this article, I will be covering a step-by-step approach to Exploratory Data Analysis (EDA). I have already mentioned WHY I will be skipping the Discovery phase in my previous article (under Project Deliverables).
In a real-world project, we have to engage with stakeholders first and perform some Project Management related tasks such as, Project Initiation, Stakeholder Management, Requirement Collection, Requirements Life Cycle Management, Resou allocation, Budgeting, and other project management related tasks (which I have decided not to include in this article series)
I’ll be using the Superstore Sales dataset for this small-scale end-to-end Business Intelligence (BI) project.
Exploratory Data Analysis (EDA)
For the senior (or) master-level data people — data analysts, business analysts, data scientists — the following contents may not come as a surprise. Yet, for the starters and transitioning data people, I’m confident that this article will provide a framework that can help you.
“We can think of it as the initial introduction or the first handshake, hello, to your data. This is where we get a chance to understand its structure, its quirks, its nuances, and you really get a chance to understand deeply what type of data you’re going to be working with and understanding what potential that data has to answer all of our questions” (Molly, Instructor, Google Data Analytics Professional Certificate).
OK! Let’s dive in. I’ll walk you through a step-by-step process of performing an effective EDA. The following steps will cover HOWs.
- Understanding the context
- Check for missing data
- Basic descriptions of our data
- Uncover the pattern of our data
- Identify significant correlations
- Spot outliers in the dataset
For this article, we will dig into the first 3 steps.
1. Understanding the Context
Before we start exploring the data, we should try to understand the data at a high level. As I have mentioned above, we will gain context of the datasets we are dealing with along the process of a Discovery phase. We communicate with our counterparts to gain as much context as possible.
Once you have your data in the desired working environment, it’s usually a good idea to look at the first 5–10 rows. But we have to look at the entire dataset for initial findings. Here are some initial findings from the dataset:
- The dataset is about sale transactions, shipments, returns, customer information, and product information.
- It consists of historical information collected on a daily basis. Order Dates and Ship Dates Snapshot Dates range from ‘2014–1–3’ to ‘2017–12–30’.
- The dataset contains 9,994 observations (rows) and 21 attributes (columns).
- Each row of the dataset is uniquely identified by the ‘Order ID’ and ‘Order Date’ columns.
- There are 793 unique Customer IDs and Customer Names in our data.
- The dataset has contained 49 states, 531 cities, 631 unique postal codes from 4 regions — Central, East, South, West — from the USA.
- There are 8,396 unique Product IDs and 1,850 unique Product Names in our data.
- These products belong to 17 Sub-Categories. And these Sub-Categories also belong to Furniture Category, Office Supplies Category, Technology Category.
- Within the observation period (2014–2017), Superstore generated $2.3 M in sales with a gross profit of $286 K. Superstore sold 37,873 units in this period.
These initial findings are data-generated insights. In a real-world project, we have to deal with multiple datasets with different degrees of complexity. Generating initial findings and taking notes play a critical role.
On the other hand, as a data guy, we always find a way to better understand the context by reaching out to our stakeholders and collecting more information from the business side.
One of the important pieces of information I found valuable to me during this step (Understanding the Context) is to get access to a journey map (or) a process map from our stakeholders. Otherwise, we can co-develop this with our counterpart. (This might also be my personal preference.)
As you might notice in the below customer journey, a data analyst can follow the stages and observe business goals and KPIs associated.
“I would like to encourage you to develop a journey map based on your imagination of Superstore and initial findings from the dataset.”
“This activity will also enhance your skill in whiteboarding. Whiteboarding is a way to communicate our idea with the stakeholder. Whiteboarding skill test is widely popular in technical interview.”
2. Check for Missing Data
We will now check for missing values in our dataset. Checking your data for missing values is always a good start. For this dataset, we cannot perform this task as there is no missing value. This is the big gap between training datasets and real-world datasets. In the real world, you need to extract data from different sources, need to do a lot of cleaning. So bare with me and bare with this dataset.
For the real-world project, I suggest analyzing attributes (or) features (or) columns one at a time and sorting them in accordance with your specific interest. We also need to understand why the data is missing and what it can mean. On the other hand, this isn’t always so simple and an answer might not exist. That’s why an entire area of understanding context along with statistics, imputation, is devoted to this problem and offers several solutions. What approach you choose depends entirely on the type of data.
(For treating missing values, I will cover that topic after, I have explained all the EDA steps I have mentioned above.)
As we will be mainly using Power BI and KNIME for this article series, I will uncover some features that can help us check for missing data. Performing this (Check for Missing Data) on Power BI a bit easier. In Power BI, one of the features that have the ability to get a visual profile of your data in Power Query Editor (aka Transform Data) BEFORE you load the data.
The first data profiling feature is very simple. Note the green bar in column name (in the image). If there are any errors in the columns, the percentage of errors will be indicated with a percentage of this bar being a different color. Red for error, grey for empty, green for OK.
The next thing we can do is go to the View (1) on the menu tab. Turn on Column Distribution (2) and Column Quality (3) as in the image. Then something wonderful happens — look at the data profiling information shown in 4 below. The data profiling tools look at the first 1,000 rows in the preview data loaded and show you the big picture of what the data “looks” like.
3. Basic Descriptions of Our Data
Qualitative Data Type
Qualitative or Categorical Data describes the object under consideration using a finite set of discrete classes. It means that this type of data can’t be counted or measured easily using numbers and therefore divided into categories. The gender of a person (male, female, or others) is a good example of this data type.
These are usually extracted from audio, images, or text medium. Another example can be of a smartphone brand that provides information about the current rating, the color of the phone, the category of the phone, and so on. All this information can be categorized as Qualitative data. There are two subcategories under this: nominal and ordinal
Nominal: These are the set of values that don’t possess a natural ordering and have no inherent mathematical meaning. For example, gender, race, state of residence, product category, etc.
You can assign numbers to categories in order to represent them more compactly, but the numbers don’t have mathematical meaning. Let’s understand this with some examples. The color of a smartphone can be considered as a nominal data type as we can’t compare one color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one where we can’t differentiate between male, female, or others. Mobile phone categories whether it is midrange, budget segment, or premium smartphone is also nominal data type.
For nominal data type where there is no comparison among the categories, one-hot encoding can be applied which is similar to binary coding considering there are in less number and for the ordinal data type, label encoding can be applied which is a form of integer encoding.
Ordinal: These types of values have a natural ordering while maintaining their class of values. If we consider the size of a clothing brand then we can easily sort them according to their name tag in the order of small < medium < large. The grading system while marking candidates in a test can also be considered as an ordinal data type where A+ is definitely better than a B grade.
These categories help us deciding which encoding strategy can be applied to which type of data. Data encoding for Qualitative data is important because machine learning models can’t handle these values directly and needed to be converted to numerical types as the models are mathematical in nature.
Quantitative Data Type
This data type tries to quantify things and it does by considering numerical values that make it countable in nature. The price of a smartphone, discount offered, number of ratings on a product, the frequency of processor of a smartphone, or ram of that particular phone, all these things fall under the category of Quantitative data types.
The key thing is that there can be an infinite number of values a feature can take. For instance, the price of a smartphone can vary from x amount to any value and it can be further broken down based on fractional values. The two subcategories which describe them clearly are:
Discrete: The numerical values which fall under are integers or whole numbers are placed under this category. The number of speakers in the phone, cameras, cores in the processor, the number of sims supported all are some examples of the discrete data type.
Continuous: The fractional numbers are considered as continuous values. These can take the form of the operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the cores, and so on.
Key Takeaways
The goal is to classify all your features into one of these three categories. You might be asking yourself, how does classifying features help us? This categorization helps us decide what visualizations to choose in our EDA, and what statistical methods we can apply to this data. Some visualizations won’t work on all continuous, discrete, and categorical features. This means we have to treat groups of each type of feature differently. We will see how this works in later sections.
Let’s focus on continuous features first. Record any characteristics that you think are important, such as the maximum and minimum values for that feature. Do the same thing for discrete features. For categorical features, some things I like to check for are the number of unique values and the number of occurrences of each unique value. Let’s add our findings to our summary notes. There are X continuous features, X discrete features, and X categorical features.