Do you want to move to Milan?Neighborhoods Sentiment Analysis using Airbnb data

This project is part of the Udacity Data Scientist Nanodegree Program: Write a Data Science Blog Post and the goal was to choose a dataset, apply CRISP-DM Process (Cross Industry Process for Data Mining) and effectively communicate the results of the analysis.

Business Understanding
Data Understanding
Prepare Data
Data Modeling
Evaluate the Results
Deploy

Looking at the suggested datasets I was pretty stuck because of the too many options. Then, since some friends and I were thinking of moving to Milan to be closer to our workplaces, I have decided to use Airbnb data to do a sentiment analysis of its neighborhoods.

Business Understanding

The goal of the project was to answer to at least three questions related to business or real-world applications of how the data could be used so I have chosen:

Which are the 5 neighborhood with highest score?
Which are the 5 neighborhood with lowest score?
How different is the overview of the neighborhood given from the hosts from the one given by the guests?

Data Understanding, Prepare Data and Data Modeling

The dataset is composed by 20626 hosts listings and 469653 customers reviews. Both listings and reviews are written in different languages: Italian, English, French, Russian ecc.

After understanding which data could be useful for my goal I had to map the listing neighborhoods with the real ones: Milan is composed by 130 neighborhoods but only 74 were covered by at least a listing after the mapping.

The mapping has been done also manually using Google Maps where an automatic approach was not possible.

Using the listing_id is possible to connect each review to the real_neighbourhood.

For both listings and reviews then I have detected the used language and marked each record.

Around 34% of the listings has a neighborhood overview with not detectable language, 37% English and 26% Italian.

Around 58% of the reviews are in English and 20 % in Italian.

Then I have decided to focus on English listings and reviews and use only the records marked like so. In listing using neighborhood_overview we can directly get the sentiment of the neighborhood but for the reviews for the comments we have to extract only the neighborhood-related sentences.

For each listing we now have a neighborhood_sentiment

The same was done also for the review, extracting the sentences related to neighborhood using a list of synonyms: neighborhood, area, block, district, ghetto, parish, precinct, region, section, slum, street, suburb, territory, zone, location

Fore example let’s consider the first comment:

Staying at Francesca's and Alberto's place was a pleasure. Just as described, well located for my purposes, an enjoyable walk to the Tortona area. The room is very nice, cleaned daily and has private bathroom.

Francesca is super friendly and very helpful; whilst still respecting privacy.

Overall a great experience!

For our purpose we have to consider only:

Just as described, well located for my purposes, an enjoyable walk to the Tortona area

Now by grouping by neighborhood we obtain the sentiments we were looking for.

On average the sentiment of the neighborhood given by the contextual sentences of the review of the guests is way higher compared to the neighborhood_overview given by the hosts. One possible explanation could be the wide length of the neighborhood_overview text that negatively influences the sentiment analysis score. Whereas the extraction of only useful sentences allows to clear up the string used for the analysis, giving an overall higher score.

Conclusion

To summarize the steps:

Map the dataset neighborhood to real neighborhood
Detect the language used in neighborhood_overview for listings and comments for reviews
For listings: compute the sentiment analysis of neighborhood_overview
For reviews: isolate the sentences related to neighborhood and compute the sentiment analysis
Compare listings and reviews results

Outro

I hope the post was interesting and thank you for taking the time to read it. The code for this project can be found in this GitHub repository, on my Medium you can find a more in depth story and on my Blogspot you can find the same post in italian. Let me know if you have any question and if you like the content that I create feel free to buy me a coffee.