That's a SEO-approved headline if I've ever seen one.
Thought it'd be good to share a research project I did about 8 months ago for my Data Analysis class. Maybe someone else can duplicate it in another industry and benefit from it. The basic instructions for the project were to use exploratory data analysis to gain a business insight; it was an introductory class for business professionals so it didn't involve very heavy lifting but it was challenging nonetheless. Well since I concurrently was working (and still am) for SolarCity, I thought it'd be good to incorporate it because there might be benefits that emerge throughout the process. Of course, I won't disclose any business information.
My goal was to pick a regional warehouse and be able to find the most "attractive" areas in which we should focus our sales efforts on. First I picked the smallest area that the US Census gave open data for (which was the zip code) and found the top 50 zip codes that we have current customers in. My theory was that our future customers would more likely be similar to our current customers than not, so I decided to focus my data analysis on describing our current customer base. The US Census will give you a shitload of data, as long as you painstakingly gather it in little chunks. So that's what I spent most of my time on.
Below is an example of the finished product of collecting the data individually and organizing all of it. I'd say this took about 2/3 of the total project time:
First I just played around with a portion of the data, to make sure it made sense before proceeding. To do so, I did a X-Y scatterplot of each Census data column against the total count of customers in each category; thus a positive correlation would mean you would expect more customers in a zip code the higher it is in that certain Census variable. I only used the relative percent of some of the Census data columns, since this would paint a better picture than total numbers for certain ones. Below is an example of linear regressions of some, with neutral, positive and negative correlations:
This yielded some interesting patterns. I wanted to further analyze it and divided each variable (column of a particular statistic from the Census) into similar groupings. Yes I added some personal bias to this analysis, but rarely was it hard for me to pick a particular grouping for each variable, so I'm confident I didn't inject too much bias. Below are the groupings of the categories of different Census data:
Then I found the correlations by grouping, this was particularly interesting to see what had the highest positive and negative indicators of more customers. Below is the graph by grouping: