AirPrice - AirBnB Price Optimization
The app is live at https://airprice.now.sh/. The app was a team effort, see this page for the team members and roles. The source of the data science portion of the app can be seen here.
The app is only usable for Berlin, DE. This is by design since we only had 1 week to complete the app.
What is it for?
Many owners of AirBnB listings set the price either too low or too high. If the price is too low then the room will always be filled, but they aren't getting as much money as they could from the listing. If the price is too high, the room will not be filled and the profits will be sub-optimal. Our goal was to suggest a middle-ground price at which the profits are optimized.
Data Science Methodology
See the primary notebook here.
Data was retrieved from http://insideairbnb.com/get-the-data.html on February 1, 2020. AirBnB freely offers detailed listing information as well as summary listing information in csv format. We downloaded 2 years worth of summary information for Berlin. We chose to use summary info rather than detailed info because of inaccuracies found in the detailed info.
For features, we had to decide whether to use "neighborhood", which had 139 distinct values, or "neighborhood group", which had 12 distinct values. We wanted to use "neighborhood" because of its higher specificity which would yield a better model. The challenge was figuring out how to get the neighborhood from the user. We talked with the web devs about incorporating functionality for programmatically looking up the neighborhood from an address, and we even built an endpoint for looking up neighborhood given geographical coordinates. Unfortunately however, we ran into difficulties in using the Maps API and we did not finish this functionality. Therefore for the final product we decided to use "neighborhood group" which was to be simply selected via dropdown.
The categorical variables we decided to use were "neighborhood group" and "room type". The continuous variables we decided to use were "minimum nights", "number of reviews", "reviews per month", "calculated host listings count" (number of listings by the host), "availability 365" (days available per year), and "last review seconds ago" (seconds since the last review). We used a OneHotEncoder to encode categorical variables and a SimpleImputer to fill in missing values.
Feature Importances
We established a baseline by using the mean price for every estimate. The mean absolute error (MAE) for this baseline was €40.38.
One of our primary challenges was establishing a model that could run with limited resources on a heroku hobby dyno. Our initial model was a RandomForestRegressor with no maximum depth. The model had an impressive MAE of about €5, however it used 1.5GB of memory and was therefore out of the question to use in our API. We tried training more models with maximum depth of 75, 25, and 20, the last of which had a MAE of about €19 and used about 300Mb of memory. At this point we decided to try a different model.
We tried a LogisticRegression but the best result with this model was a MAE of €66- worse than the baseline. We then moved on to a XGBRegressor. Our best result (using a maximum depth of 20 and a learning rate of 0.22) was a MAE of €15 and 106Mb of memory usage. We chose not to use a RandomizedSearchCV to optimize parameters because of the long training time. This model causes the heroku app to go over the "soft limit" for memory (it uses swap), but it is still usable.
Visualizations