619-269-8606

Statistical Analysis of Yelp restaurant ratings

Want to chow down in L.A.?

20,000 restaurants analyzed

What does rating and review data reveal about the Los Angeles, California food scene? Let’s take a look at data on almost 20,000 eateries which was retrieved courtesy of the Yelp Developer API.

For those of you not familiar with it, Yelp (yelp.com) is a popular review site that uses a 1-5 rating scale; 5 being the best.

The images below are screenshots from statistical and mapping analysis using Python programming.

All project code and notes

Fork or clone from the GitHub repository.

Restaurants Mapped

Clicking on this map links to a larger, interactive display featuring mouse controlled zoom, rotate and tilt features. It shows the average rating for each restaurant on the map. The ratings system is based on a scale of 1-5; 5 being the “best”. Red represents ratings under 2.5, yellow below 4 and green is for top notch ratings which are either 4.5 or 5.

Click to see live map of rated restaruants in Los Angeles | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Ratings Gap

This histogram is reflects the total count of ratings and in so doing reveals a void. Note the dip of frequency in the 2.5 star category. For whatever reason reviewers were inclined to rate restaurants above 3 or below 2.

Histogram of Ratings Gap | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Skew towards 4

It seems like establishments that stay in business long enough are likely to wind up with a rating of about 4. The next series of visualizations show the correlation between review counts and ratings.

Visualization of Review Counts and Ratings | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist
Visualization of restaurants by zip code rating and review count | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist
Visualization of restaurants by zip code rating and review count | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Breaking the mold

The outliers reveal a virtual tidal pull towards a mean rating of roughly 3.8 – 4.1 over time. However, there are a few establishements that have managed to defy the odds and either stay in business despite horrible reviews, or maintain top notch ratings along with high review counts.
Bold and bad

Zipcode 90045 stands out as a hotbed of rotten ratings. It is visually represented as a light blue circle towards the higher end of the review counts but on the lower end of ratings blue bubble chart above.

Bad zip code for restaurant reviews | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

The worst of the worst

This zipcode is also home to what might be the worst restaurant in all of Los Angeles with a whoppingly poor average rating of 1.5, yet while continuing to stay in business as reflected by relatively high number of reviews.

Zipcode 90045 stands out as a hotbed of rotten ratings. It is visually represented as a light blue circle towards the higher end of the review counts but on the lower end of ratings blue bubble chart above.

Map of Worst Rated Restaurant in LA | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Captured audience?

Digging a little deeper into the 90045 revealed that this area is also home to the Los Angeles International Airport and numerous hotels. Perhaps the sheer volume of people with little other option than to grab a bite close to accomodations or flight times enables these businesses to stay open despite poor ratings.

Zipcode 90045 stands out as a hotbed of rotten ratings. It is visually represented as a light blue circle towards the higher end of the review counts but on the lower end of ratings blue bubble chart above.

Map of  California Dining Airport Café| Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Heads above the rest

On the positive end of the outlier spectrum are a few places that seem worthy of making a special trip to check out.

Most reviewed

Note that in the light pink and brown scatter plot there is an outlier category 4 column that has over 17,000 reviews. This is nearly triple the median of review counts and the turns out to be: “Botega Louie” a patisserie & café in downtown.

Map of Restaurant with Most Reviews Bottega Louie | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

The best

Out of 20,000 locations, a few manage to hold on to top ratings even as review counts climb and the locations seem to be spread throughout the city and reflect a variety of cuisines.

Map of best restaurant Pisces Poke and Ramen | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Trending

Looks like the food categories that are rated highest are types that might not even have been readily available in the not too distant past such as Poke, Vegan and Halal.

The image below is a screenshot of a table and graph reflecting top rated cuisines.

PLOT Table and Graph of top rated cuisines | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Most loved

Seems like LA loves Mexican food more than anything else and the data reveals the number one most popular food item: Tacos.

Map of Best Loved Restaurants in LA | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

Diverse

The diversity of Los Angeles is reflected in the proximity and variety of options. For example these are “Persian” restaurants:

Map showing diversity of LA based on Proximity and Variety of options  | Project by Sheri Rosalia | Data Engineer | Data Analyst | Data Scientist

The Data

Ratings, reviews and geographical data were accumulated courtesy of the Yelp developer api.

The Python scripts for the api and first map is available for download in this GitHub repository: LA_Restaurants_Yelp_API

Clean up, graphs, google maps and csv files are available in the GitHub repository associated with this webpage. Here is the repository link: Stars of LA

The data and it’s limitations:

Most, but not every location is accounted for in this data set because of the way the Yelp api works. What we have here is a random sampling of almost 20,000 establishment from over 150 zipcodes but there are likely more than 5,000 restaurants missing.

As is often the case, diving into analysis produces more questions than actual conclusions. Ratings are subjective by nature. It is hard to be certain as to the reasons for them and in the age of paid social influencers, their veracity suspect.

Related Projects

Lines of inquiry such how the data could be further analyzed using Machine Learning, statistical analysis, combining with health code violation and demographic information are explored in the following “spinoff” projects.

Sheri Rosalia | Data Engineer

Data Engineer | Data Analyst | Data Scientist