The purpose of this article is to document my process in finishing the capstone project of the Google Data Analytics Certificate.
In case you don’t know, Google Data Analytics Certificate is offered by Google (obviously) on Coursera, in which you can learn basic knowledge and practical skills for a data analyst job.
The capstone project aims at bringing together everything people learn in the course and putting them into practice with a data analytic case study.
You can click the following link to learn more about the certificate:
https://www.coursera.org/professional-certificates/google-data-analytics
Overview of the article
- Cyclistic, a fictional bike-share company wants to understand how casual riders and annual members use their bikes differently
- The analysis breakdowns into 6 major stages, which are: Ask, Prepare, Process, Analyse, Share and Action
- Historical trip data between Nov 2020 and Oct 2021 are used for the analysis, I am using the data from this website, under this license (notice the company name is different as Cyclistic is a fictional company)
- All data are cleaned, manipulated, and analysed using SQL, then visualised using Tableau (I am trying to get familiar with these tools)
- To make the article easier to read, only some of the queries and visuals I used through the process are included here
- For the charts and the slides I used for this project, please click these: Google slides, Tableau Data Viz
Now let’s jump right into the project itself, hope you enjoy :)
Background
In this capstone project, a fictional bike-share company in Chicago, Cyclistic, wants to design a new marketing strategy to convert casual riders into annual members. The company believes its future success depends on maximising the number of annual memberships. Now the marketing analyst team is trying to understand how casual riders and annual members use Cyclistic bikes differently.
*In this article, annual members refer to customers who purchase the company's annual memberships; casual riders refer to customers who purchase single-ride or full-day passes.
Ask
The final goal of this analysis is to design a new marketing strategy for the company, to convert casual riders into annual members, which are proved to be more profitable.
With that in mind, the following analysis attempts to answer this question:
How do annual members and casual riders use Cyclistic bikes differently?
Hopefully, by answering this question, I can provide insights for the marketing team about what may turn casual riders into annual members, where and when is the best time to start a marketing campaign.
Following are the key stakeholders I need to consider in the analysis:
- Director of Marketing: responsible for the development of campaigns and initiatives to promote the bike-share program, including the new marketing strategy
- Marketing Analytics Team: the team I belong to (once again hypothetically), which is responsible for collecting, analysing and reporting the data that helps guide Cyclistic marketing strategy
- Executive Team: decide whether to approve the recommended marketing program as according to the analysis
Prepare
As previously mentioned, the original data are located here, provided and licensed by Motivate International Inc., the operator of the City of Chicago’s Divvy bicycle sharing service (notice the company name is different because Cyclistic is a fictional company).
Original data on the website are stored in different ZIP files according to the year and month in CSV format. Every CSV file contains data of every trip in that particular month, which are separated into columns including the type of rider, type of bike, the time, date and location (including the exact geographical location) that every bike trip started and ended. Furthermore, every trip and station has its unique id. However, due to data-privacy consideration, there is no identification for the rider in the dataset, therefore it is not possible to identify if the same person has purchased multiple services.
The most up-to-date data from the past 12 months (Nov 2020 — Oct 2021) are extracted for analysis.
Process
For this project, I choose to use SQL for cleaning, manipulation and analysis and Tableau for visualisation, to gain some hands-on experience with these tools.
After I download all the relevant data, they were all uploaded to BigQuery for cleaning, then I merged them into one table within BigQuery for manipulation and further analysis.
Cleaning
All monthly datasets were checked and cleaned with a similar process, to check if there were any:
- Duplicated rows
- Missing values in important columns
- Typo in columns with string attributes
- Problems in the time range between the start and end time of each ride
- Unmatched data type between monthly tables
Through the above steps, I discovered some rows in the monthly tables had missing values in the start_station_id, start_station_name, end_station_id and end_station_name columns, also, some rows had values in started_at column which were larger than or equal to that of ended_at. The rows that match these conditions are excluded. Also, two columns (start_station_id and end_station_id) in the 2020_11 table had unmatched data types with the same columns of other tables, which I also amended.
The cleaned monthly datasets were then exported into a separated folder in BigQuery, and then merged together.
Manipulation
In addition, I did some data manipulation to the new merged table:
- Add a new column for ride length, which I later transfer into minute as a measurement
— Step 1
ALTER TABLE
`coursera-case-study-1-trail.Cyclistic_clean_data.12_month_from_2020_11_ver2`
ADD COLUMN ride_length INT
— Step 2
SELECT *
EXCEPT(ride_length),
(ended_at - started_at) AS ride_length
FROM
`coursera-case-study-1-trail.Cyclistic_clean_data.12_month_from_2020_11_ver2`
— Save new table
- Add a new column to extract the day of week in which every ride started
— Step 1
ALTER TABLE
`coursera-case-study-1-trail.Cyclistic_clean_data.12_month_from_2020_11_ver3`
ADD COLUMN start_day_of_week INT
—Step 2
SELECT
*
EXCEPT(start_day_of_week),
EXTRACT(DAYOFWEEK FROM started_at) AS start_day_of_week
FROM
`coursera-case-study-1-trail.Cyclistic_clean_data.12_month_from_2020_11_ver3`
— Save new table
Analyse
After probably processing the data, I started analysing the merged table, in an attempt to answer the questions I mentioned. In this stage, I once again used SQL on BigQuery. In addition, some query results were exported to Tableau to help me better understand the data.
Number of riders
To begin with, I wanted to know how many rides were purchased by the company’s annual members and casual riders, so I ran a simple query to count the number of rides categorised by the type of rider. Turn out although members purchased more rides within the past 12 months, the difference between members and casual riders was not significant.
So I ran another query to divide the number by the type of bike:
While both members and casual riders preferred the company’s classic bike, the docked bike was also popular among casual riders.
Ride duration
Then I look into the distribution of ride duration by checking the maximum, minimum and average value in the ride_length column that I previously created.
Here comes some awkward results. The minimum ride duration was 1 second, and the maximum ride duration was 932 hours. With no way to determine if there were any reasonable explanations for these extreme values, I decided to use median instead of average to measure the ride duration. As the data type of the ride_length column was INTERVAL, I took some extra steps to transform the values there into normal numbers in minutes:
SELECT
*,
EXTRACT(HOUR FROM ride_length) * 60 + EXTRACT(MINUTE FROM ride_length) + EXTRACT(SECOND FROM ride_length) / 60
AS ride_length_in_minute
FROM
`coursera-case-study-1-trail.Cyclistic_clean_data.12_month_from_2020_11_ver4`
Then I saved the query results into a new table and calculate the median ride duration of type of rider and type of bike.
The above visualisations show that members’ ride duration per trip was much shorter. And like the previous section, casual riders’ usage of the docked bike was abnormal compared to other categories, as the median ride time was at around 28 minutes, a lot higher than other categories.
As causal riders include both single-ride or full-day pass purchasers, I suspected full-day pass purchasers may have a longer ride duration and preference on the docked bike. However as there was no way to identify whether a casual rider purchased a single-ride or full-day pass, I decided to move on to the next section.
Popular stations to start and end riding journey
I also wanted to understand if members and casual riders had any differences in their location preferences.
First I ran a simple query to understand which were the top 10 stations to start riding journey by counting the total number of rides by start station. I also broke down the number into members and casual riders to understand their difference.
After discovering the big difference between the number of members and casual riders in these stations, I decided to run some extra queries to check the ranks of these stations in general and among the two types of riders.
It is obvious that members and casual riders have a different location preferences. But the above tables cannot describe what is the difference. Therefore I decided to create some maps to better understand the situation. Through inspection of the columns that store values of geographical location, I learned that under the same station id and station name, there can be a variety of latitude and longitude. Hence, average latitude and longitude were used for every station, to make the map easier to read and process.
The two maps clearly showed that, despite some overlapping, causal riders have a stronger preference towards stations close to the recreational areas, meanwhile members’ location preferences are more diverse.
Ride pattern by hour of day, day of week and month
The casual riders’ preference over recreational areas led me to further dig into the time preference of members and casual riders in starting their ride journey.
While both members and casual riders prefer the summer season (June to September), their weekly and daily preferences were in bigger differences. So I used one more graph to under the influence of day of week and hour of day over the two types of riders’ number of rides.
Evaluating the two factors at the same time, it is obvious that the members prefer to start their ride on weekdays at communal hours, especially morning, and casual riders prefer weekday evenings (especially Friday) and weekend afternoons.
Similarities between casual riders and members
While the above analysis showed members and casual riders are largely different, a proportion of casual riders have similarities with members. So I ran one last query to find out the number of casual riders that ride duration was under 10 minutes and start their rides at communal hours of the weekdays.
SELECT
COUNT(*)
FROM (
SELECT
*,
EXTRACT(HOUR FROM started_at) AS extract_hour
FROM `coursera-case-study-1-trail.Cyclistic_clean_data.12_month_from_2020_11_ver7`
)
WHERE
member_casual = ‘casual’
AND
ride_length_in_minute < 10
AND
start_day_of_week BETWEEN 2 AND 6
AND
extract_hour IN (6, 7, 8, 9, 16, 17, 18)
Query result shows that there were 143,661 casual riders that match these characteristics, which account for 7% of all casual riders.
Summarising the analysis findings
To conclude, annual members mainly use the company's bike for weekday commune, and casual riders tend to use the bike for leisure purposes.
Share and Action
Base on the findings in the previous section, the following are some suggestions for designing a new marketing strategy for the company to convert casual riders into annual members:
- Time of campaign: June to August, weekend afternoon
- Campaign location: recreational area of Chicago
- Campaign message: promote Cyclistic as a method of a weekday commute
Thanks for reading the article, please let me know if you think there are any rooms for improvement in my first project! Any helps are welcomed :)