As I came from a non-tipping country, I am very interested in analyzing tips of NYC, so I can understand the tipping convention and know how much I should tip the cab drivers if I travel to New York. Meanwhile, analyzing tips might also reveal some interesting information about economics or city development.
My Guess for factors affecting tipping are:
For this project, I will first do ETL to filter unwanted data and extract useful features, then do some general analysis about the relationship between tipping and other features, finally try if I can construct a model to predict the tip of a trip.
Please check home page for the intial ETL
Tip Ratio/Percentage is calculated by tip_amount/(total_amount - tip_amount)
. Percentage will be used for subsequent visualization.
Tip Range: Based on the distribution of tip ratio, we further assign the tip to several ranges to help analyze the massive data. The tip range is calculated byceil(20*tip_amount/(total_amount - tip_amount))
, below is the table of range index and corresponding range:
tip_range_index | indicated range |
---|---|
0 | tip=0 |
i between 1 and 8 | tip > 5i, <= 5i + 5 |
9 | tip > 40 |
total_amount-tip_amount-fare_amount
. Includes all other fares such as tolls. Should be useful for tip prediction.After ETL, we have 114,348,011 records to analysis.
According to New York Official Guide, the tipping to cab driver should be 15–20 percent of total fare, which is confirmed by our result.
year | mean_tip_percent |
---|---|
2017 | 18.752909945721235 |
2018 | 18.636400564789707 |
2019 | 18.146659574580887 |
2020 | 18.43724157920186 |
2021 | 18.73994173994071 |
Some trips have extremely huge tips that can go up to hundreds of dollars. The largest tip of our filtered record is 1001 dollar, from a trip in the early morning on 2020-07-02.
pickup_datetime | 2020-07-02 05:50:12 |
dropoff_datetime | 2020-07-02 05:55:46 |
from | UN/Turtle Bay South, Manhattan |
to | Gramercy, Manhattan |
fare_amount | 6.5 |
tip_amount | 1001.0 |
total_amount | 1010.8 |
At first I thought this is a failure in car recorders, but similar stories have been reported. Check the articles crazy-taxi-stories-and-data-from-new-york-city and philly-cab-driver-gets-1000-tip.
Daily Max Tip From 2017 to 2021
At first, I thought the economic downturn caused by covid would lead to decrease in tipping, but the tip distribution barcharts overturned my guess as the tip distribution remained although the trip amount decrease significantly in 2020. The daily mean tip heatmap below shows more detail. The mean tip decreased in February 2019, and rebounded in March 2020, and the mean vary from 17 to 21%, not a significant difference. So may be the economic recession only affects on the trip amount. The poor no longer call a taxi, while the rich keep their tipping behavior.
Also notice that the mean is relatively higher from March to June 2020, the lockdown period of NY. So we might infer that the difficulty of travel lead passengers to tip more considering the hard work of the drivers.
You might have noticed that in the previous distribution barcharts, the green cab passengers tend to tip less than those yellow. In 2021, Quite a few green cab passengers do not even tip. In New York, yellow cabs run mainly in the busy commercial areas of Manhattan, while green cabs, introduced in 2011, run in areas not served by yellow cabs. This give us a new question, do tipping vary by the pick-up/drop-off locations?
The answer is yes. We consider a trip occurs at a location if it is the pick up location or drop off location, then do a bunch of join quries to get the mean tip ratio by location. The heatmap below shows the mean tip ratio of each location in New York. Red areas have higher tip_ratio while blue areas have lower. We can see the business zone, e.g. Manhanttan, North Queens, have relatively high tipping with red color, while remote resident areas and parks in the south have low mean percent colored in blue. Notice that the red area in the south east is the JFK Airport. Check the New York City’s Zoning & Land Use Map.
The cab driver may want to know how much he can get for the tip based on the time, location, etc. So let’s try to construct models and see if we can predict the tips.
Regressors
model | Evaluation Score | TOP 2 important features |
---|---|---|
DecisionTree | r2 = 0.0427 rmse = 6.5653 |
other_amount: 0.8204, fare_amount:0.1748 |
RandomForest | r2 = 0.0474 rmse = 6.5468 |
other_amount: 0.2284, fare_amount: 0.1713 |
Gradient Boosted Tree | r2 = 0.0427 rmse = 6.5653 |
other_amount: 0.8204, fare_amount:0.3296 |
Classifiers
model | Evaluation Score | TOP 2 important features |
---|---|---|
DecisionTree | accuracy=0.5417 | other_amount: 0.5913, fare_amount: 0.2621 |
RandomForest | accuracy=0.5224 | other_amount: 0.6657, fare_amount: 0.3144 |
I tried several models provided by pyspark, and neither the regression nor the classification models performed well. It seems we do not have sufficient data to perform tip prediction. However, all models agree that fares and other amount have the greatest importance on deciding tips, which we do not consider previously. After ploting the data, we can see that some passengers tip quite generously on small amount trip.