Analyzing 2020 Stack Overflow survey using a CRISP-DM process

6 min readJan 3, 2021

The cross-industry standard process for data mining (in short CRISP-DM )is an widely used open standard process model that describes common approaches used by data mining experts.

In order to demonstrate how CRISP-DM works in practice, I’m going to apply the approach to a real-world example. I have chosen the Stack Overflow Annual Developer Survey dataset from 2020. The Stack Overflow Developer Survey is an annual survey among employees, specialists, and enthusiasts within the IT and Computer Science field conducted by the Q&A site for professional and enthusiast programmers Stack Overflow.

CRISP DM usually follows a 6-steps approach consisting of:

· Business Understanding

· Data Understanding

· Prepare Data

· Data Modeling

· Evaluate the Results

· Deploy

Let’s look at each step in detail.

Business Understanding

The first part is about asking relevant questions regarding a topic from a business perspective. For the scope of this project, I am particularly interested in the following questions regarding the dataset of 2020

1. How is education level relevant for breaking into Data Science?

2. What languages do data professionals work with?

3. How does average workload vary across countries ?

Data Understanding

This stage challenges me the most. For me, the right choice is going through the whole dataset once holistically and outline a pathway for me to maximize the potential of data to solve posted problems.

You might choose to do that or start learning the data with a few basic steps:

How big dataset is, in volume? and How many rows and columns?
How much data is missing?
How many columns have text, numeric or categorical values? (It takes more time to transform data with values other than numeric.)
What all columns can be removed if they hold irrelevant information?

…. and more questions like those.

19 columns are missing more than 30% of the data

Prepare Data

Data Scientists generally spent 80% of their time to prepare data. I have performed the below steps to do the data preparation.

Imputing missing values
Data-wrangling
Data-transformation

There was a lot more I could have done to transform the data, but we have to start somewhere, and then we can always iterate. Try to follow DRY principle.

Model Data

Finally, we train a model using a Machine learning library like scikit-learn to pick an appropriate algorithm and built a model that can make a prediction using a validation set, within your desired accuracy.

Once we reach an acceptable prediction accuracy, we can finalize that model for the deployment. Currently, I am not using predictive analytics in this dataset for demo.

Results

Communication is such an important part of the role of a data scientist. You generally share the outcome of your analysis at this point using some visualizations or statistics. Sometimes, you might change your modeling algorithm or data preparation techniques, based on the outcome. Let’s jump to the results of the EDA I did on StackOverflow Developer Survey 2020.

How is education level relevant for breaking into Data Science?

columns EdLevel captures the responses of the question
“Which of the following best describes the highest level of formal education that you’ve completed”?

% Percentage of different responses entered

Most of the Respondents hold a Bachelor’s degree.
At 2nd position are holders of Master’s degree.
College dropouts make the the third position. Woah! Those are lot of college dropouts alright!

Programming Languages that Web Developers have Worked Within Past Year?

This one shows several things:

Most of the Respondents have worked with Javascript. HTML/CSS at 3rd position makes sense as people can’t use JS without HTML.
SQL stands at 4rd position of Languages worked with
Python makes the top position of Languages people want to work with.
Only a few people has worked with or wants to work with languages like Julia, Perl and VBA.

How does work life balance differs country wise ?

Before going to check for the work life balance, we should check where does the respondent belong from to avoid biased results from countries with low capture.

Major respondents are from USA and India

Deploy

Deploying can occur by moving your approach into production or by using your results to persuade others within a company to act on the results.

Once you reach this stage, you probably assumed that “my job is already done and all I need to do now is to put this code in production and get it over with”. Though, it is not that straight forward, but not that complex too.

If you have analyzed for yourself then indeed your job is completed but if you are a data scientist who is working for a company/team, you need to follow few more steps for deploying the code in production. For example following steps like these:

Convert your code into executable .py script. Jupyter notebook was only for analysis purposes. Simply copy your code line by line and paste it in a text file and save it with an extension “.py”.
Remove static variables from your code, for example; accuracy threshold, input file name, file directory or path, send these values at run-time as parameters. Keep your parameter list in a manageable size.
Add instructions as comments inside a python script by simply following PEP 8 style guideline. This helps another developer to understand your code and make a change if necessary.
You can choose to create separate python script for your functions and then import these scripts as a library like we import pandas, NumPy, etc.

Once your code is ready to deploy, always check if the destination python environment has your Data Science libraries installed or not. If not, you need to create a virtual environment and get all your necessary libraries installed before you deploy the code.

If you want your code to run on a schedule, then you might need a scheduler like Autosys, Cron or Windows-based scheduler.

Conclusion

In this article, we looked at how CRISP-DM helps to maintain a bigger picture while doing data analysis or training a machine learning model. Insights from StackOverflow Developer Survey 2020 support to get the idea of the CRISP-DM process.

This article is part of Udacity Data Scientist Nanodegree Project. To see more about this analysis, see the link to my Github available here.

This is my first article on medium and I hope you liked it. Thanks for reading.

Analyzing 2020 Stack Overflow survey using a CRISP-DM process

Business Understanding

Data Understanding

Prepare Data

Model Data

Results

Deploy

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Samaksh Gulati

No responses yet