This project performs data processing on movie data and user ratings in R using regular expressions, then loading into a PostgreSQL database. We will perform various queries and do a debiasing of the database.
This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Each user is represented by an id.
MovieID is the real MovieLens id and the title should be one identical to that on IMDB.
Ratings are made on a 5-star scale, with half-star increments.
Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
This is the schema of our database:
Entities
movies(id, title, year)
genres(title)
users(id)
Relationships
ratings(userid, movieid, rating, time)
tags(userid, movieid, tag, time)
has_genre(movieid, title)
In movies.txt, each line of this file represents one movie, and has the following format:
MovieID:Title:Genres
In tags.txt, each line of this file represents one tag applied to one movie by one user, and has the following format:
UserID:MovieID:Tag:Timestamp
In ratings.txt, each line of this file represents one rating of one movie by one user, and has the following format:
UserID:MovieID:Rating:Timestamp
The only file that can read in as is to Postgres is ratings.txt. The other files must be edited as there are delimiter errors.
The following R code parses our files for us:
library(stringr)
movies <- readLines("movies.txt")
parseMovies <- function(movie){
# split by first ":"
temp <- unlist(str_split(movie, ":", 2))
movieid <- temp[1]
# looks for last "):"
temp2 <- unlist(strsplit(temp[2], "\\):(?=[^:]+$)", perl=TRUE))
genres <- temp2[2]
# looks for last "("
temp3 <- unlist(strsplit(temp2[1], " \\((?=[^(]+$)", perl=TRUE))
title <- temp3[1]
year <- temp3[2]
return(data.frame(movieid = movieid, title = title, year = year, genres = genres))
}
df <- data.frame(movieid = character(), title = character(), year = character(), genres = character())
for (i in 1:length(movies)){
df <- rbind(df, parseMovies(movies[i]))
}
movies_csv <- df[,c("movieid", "title", "year")]
write.csv(movies_csv, file ="movies.csv", row.names = F)
has_genres <- df[,c("movieid", "genres")]
genresList <- data.frame(movieid = character(), genre = character())
for (i in 1:nrow(has_genres)){
list <- unlist(strsplit(as.character(has_genres[i,]$genres), split="\\|"))
for (j in 1:length(list))
genresList <- rbind(genresList, data.frame(movieid = has_genres[i,]$movieid, genre = list[j]))
}
write.csv(genresList, file ="hasGenres.csv", row.names = F)
tags <- readLines("tags.txt")
parseTags <- function(tag){
temp <- unlist(str_split(tag, ":", 3))
userid <- temp[1]
movieid <- temp[2]
# looks for the last ":"
temp2 <- unlist(strsplit(temp[3], ":(?=[^:]+$)", perl=TRUE))
tag <- temp2[1]
timestamp <- temp2[2]
return(data.frame(userid = userid, movieid = movieid, tag = tag, timestamp = timestamp))
}
tags_df <- data.frame(userid = character(), movieid = character(), tag = character(), timestamp = character())
for (i in 1:length(tags)){
res <- parseTags(tags[i])
tags_df <- rbind(tags_df, res)
print(i)
}
write.csv(tags_df, file ="tags.csv", row.names = F)
See a sample of how parseMovies() parses movies.txt to create movies.csv, using the following input:
1:Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy
It first splits the line into two pieces on the first :
.
Array[1]: 1
Array[2]: Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy
Array[1] contains the movieid. Then, it splits on the last ):
in Array[2] to split the year and genre list. By splitting on ):
and removing the genre list, it makes it easier to extract the year later.
Array[1]: Toy Story (1995
Array[2]: Adventure|Animation|Children|Comedy|Fantasy
Array[2] contains the genre list. Next we should split Array[1] by looking for the last (
. We look for the last (
in case the movie title contains parentheses.
Array[1]: Toy Story
Array[2]: 1995
And we are left with the following variables:
movieid: 1
genres: Adventure|Animation|Children|Comedy|Fantasy
title: Toy Story
year: 1995
Using the dataframe created after processing movies.txt, I then processed the genres to export to hasGenres.csv by converting the rows from:
movieid: 1
genres: Adventure|Animation|Children|Comedy|Fantasy
To:
movieid | genre |
---|---|
1 | Adventure |
1 | Animation |
1 | Children |
1 | Comedy |
1 | Fantasy |
The R script takes a row from the dataframe containing the columns movieid
, title
, year
, genres
and creates a new dataframe with movieid
and genre
. It splits genres
on the |
delimiter, and for each genre in the list, adds to the dataframe.
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="moviedb", user="postgres", password="p")
We are connected!
Let’s create our tables, and then add our data.
SET CLIENT_ENCODING TO 'utf8';
CREATE TABLE Ratings(userid NUMERIC, movieid NUMERIC, rating DOUBLE PRECISION, timestamp NUMERIC);
COPY ratings FROM 'C:/Users/claire/project/ratings.txt' DELIMITER ':';
CREATE TABLE movies(movieid NUMERIC, title TEXT, year NUMERIC);
COPY movies FROM 'C:/Users/claire/project/movies.csv' CSV HEADER;
CREATE TABLE tags(userid NUMERIC, movieid NUMERIC, tag TEXT, time NUMERIC);
COPY tags FROM 'C:/Users/claire/project/tags.csv' CSV HEADER;
CREATE TABLE users(userid NUMERIC PRIMARY KEY);
INSERT INTO users SELECT DISTINCT userid FROM ratings;
INSERT INTO users SELECT DISTINCT userid FROM tags ON CONFLICT DO NOTHING;
CREATE TABLE has_genre(movieid NUMERIC, genre TEXT);
COPY has_genre FROM 'C:/Users/claire/project/hasGenres.csv' CSV HEADER;
CREATE TABLE Genres(title TEXT);
INSERT INTO Genres SELECT DISTINCT genre FROM has_genre ORDER BY genre;
Looking at our tables
Let’s see how many rows we have and take a look at the first 5 rows of each table.
Note that we are casting count(*) as an integer only because RMarkdown has issues with integer-64 encodings.
Movies table
SELECT CAST(count(*) AS INTEGER) FROM movies;
count |
---|
10681 |
SELECT * FROM movies LIMIT 5;
movieid | title | year |
---|---|---|
1 | Toy Story | 1995 |
2 | Jumanji | 1995 |
3 | Grumpier Old Men | 1995 |
4 | Waiting to Exhale | 1995 |
5 | Father of the Bride Part II | 1995 |
Ratings table
SELECT CAST(count(*) AS INTEGER) FROM ratings;
count |
---|
10000054 |
SELECT * FROM ratings LIMIT 5;
userid | movieid | rating | time |
---|---|---|---|
1 | 122 | 5 | 838985046 |
1 | 185 | 5 | 838983525 |
1 | 231 | 5 | 838983392 |
1 | 292 | 5 | 838983421 |
1 | 316 | 5 | 838983392 |
Tags table
SELECT CAST(count(*) AS INTEGER) FROM tags;
count |
---|
95580 |
SELECT * FROM tags LIMIT 5;
userid | movieid | tag | time |
---|---|---|---|
15 | 4973 | excellent! | 1215184630 |
20 | 1747 | politics | 1188263867 |
20 | 1747 | satire | 1188263867 |
20 | 2424 | chick flick 212 | 1188263835 |
20 | 2424 | hanks | 1188263835 |
Users table
SELECT CAST(count(*) AS INTEGER) FROM users;
count |
---|
71567 |
SELECT * FROM users LIMIT 5;
userid |
---|
21424 |
38355 |
47095 |
60438 |
33597 |
Has_genre table
SELECT CAST(count(*) AS INTEGER) FROM has_genre;
count |
---|
21563 |
SELECT * FROM has_genre LIMIT 5;
movieid | genre |
---|---|
1 | Adventure |
1 | Animation |
1 | Children |
1 | Comedy |
1 | Fantasy |
Now let’s try some more advanced queries to find out more about what’s in our database.
Find the distribution of the movies across different decades.
SELECT dist_group, count(*)
FROM
(
SELECT case when year between 1910 and 1919 then '(1910-1919)'
when year between 1920 and 1929 then '(1920-1929)'
when year between 1930 and 1939 then '(1930-1939)'
when year between 1940 and 1949 then '(1940-1949)'
when year between 1950 and 1959 then '(1950-1959)'
when year between 1960 and 1969 then '(1960-1969)'
when year between 1970 and 1979 then '(1970-1979)'
when year between 1980 and 1989 then '(1980-1989)'
when year between 1990 and 1999 then '(1990-1999)'
when year between 2000 and 2009 then '(2000-2009)'
end AS dist_group
FROM movies
) t
GROUP BY dist_group
ORDER BY dist_group ASC;
dist_group | count |
---|---|
(1910-1919) | 11 |
(1920-1929) | 83 |
(1930-1939) | 230 |
(1940-1949) | 379 |
(1950-1959) | 521 |
(1960-1969) | 690 |
(1970-1979) | 784 |
(1980-1989) | 1712 |
(1990-1999) | 3022 |
(2000-2009) | 3249 |
A better approach in the case when we don’t know the distribution of years:
SELECT concat(floor(year/10), 0, '-', floor(year/10), 9) AS decade, count(*)
FROM movies GROUP BY 1 ORDER BY 1 ASC;
decade | count |
---|---|
1910-1919 | 11 |
1920-1929 | 83 |
1930-1939 | 230 |
1940-1949 | 379 |
1950-1959 | 521 |
1960-1969 | 690 |
1970-1979 | 784 |
1980-1989 | 1712 |
1990-1999 | 3022 |
2000-2009 | 3249 |
Find the distribution of the genres across the movies.
SELECT genre, count(genre)
FROM has_genre
GROUP BY genre
ORDER BY genre ASC;
genre | count |
---|---|
Action | 1473 |
Adventure | 1025 |
Animation | 286 |
Children | 528 |
Comedy | 3703 |
Crime | 1118 |
Documentary | 482 |
Drama | 5339 |
Fantasy | 543 |
Film-Noir | 148 |
Horror | 1013 |
IMAX | 29 |
Musical | 436 |
Mystery | 509 |
Romance | 1685 |
Sci-Fi | 754 |
Thriller | 1706 |
War | 511 |
Western | 275 |
Find the distribution of the ratings values.
SELECT rating, count(rating)
FROM ratings
GROUP BY rating
ORDER BY rating ASC;
rating | count |
---|---|
0.5 | 94988 |
1.0 | 384180 |
1.5 | 118278 |
2.0 | 790306 |
2.5 | 370178 |
3.0 | 2356676 |
3.5 | 879764 |
4.0 | 2875850 |
4.5 | 585022 |
5.0 | 1544812 |
Find how many movies have:
i. no tags, but they have ratings
SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid
NOT IN (SELECT DISTINCT movieid FROM tags)
AND movieid IN (SELECT DISTINCT movieid FROM ratings);
count |
---|
3080 |
ii. no ratings, but they have tags
SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid
IN (SELECT DISTINCT movieid FROM tags WHERE movieid NOT IN (SELECT DISTINCT movieid FROM ratings));
count |
---|
4 |
iii. no tags and no ratings
SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid
NOT IN (SELECT DISTINCT movieid FROM ratings) AND movieid NOT IN (SELECT DISTINCT movieid FROM tags);
count |
---|
0 |
iv. both tags and ratings
SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid IN ((SELECT DISTINCT movieid FROM ratings) INTERSECT (SELECT DISTINCT movieid FROM tags));
count |
---|
7597 |
We expect that the results from i, ii, iii, and iv will add up to count(movieid) from movies. 3080 + 4 + 0 + 7597 = 10,681 (correct)
Find the most reviewed movie (the movie with the highest number of reviews).
SELECT movies.movieid, title, count_ratings
FROM movies
JOIN (SELECT count(*) AS count_ratings, movieid FROM ratings GROUP BY movieid ORDER BY count_ratings DESC LIMIT 1) t
ON movies.movieid = t.movieid;
movieid | title | count_ratings |
---|---|---|
296 | Pulp Fiction | 34864 |
Find the highest reviewed movie (movie with the most 5-star reviews).
SELECT movies.movieid, title, count
FROM movies
JOIN (SELECT movieid, count(*) AS count FROM ratings WHERE rating = 5 GROUP BY movieid ORDER BY count DESC LIMIT 1) t
ON movies.movieid = t.movieid;
movieid | title | count |
---|---|---|
318 | Shawshank Redemption The | 16460 |
Find the number of movies that are associated with at least 4 different genres.
SELECT CAST(count(*) AS INTEGER)
FROM (SELECT count(genre) AS genre_count FROM has_genre GROUP BY movieid) t
WHERE genre_count >= 4;
count |
---|
968 |
Find the most popular genre across all movies (genre associated with the highest number of movies).
SELECT genre, count(genre)
FROM has_genre
GROUP BY genre
ORDER BY count DESC LIMIT 1;
genre | count |
---|---|
Drama | 5339 |
Find the genres that are associated with the best reviews (genres of movies that have more high ratings than low ratings). Display the genre, the number of high ratings (>=4.0) and the number of low ratings (< 4.0).
SELECT genre, high_ratings, low_ratings FROM
((SELECT genre,count(*) AS high_ratings FROM has_genre NATURAL JOIN ratings WHERE rating >= 4 GROUP BY genre) AS high
NATURAL JOIN
(SELECT genre,count(*) AS low_ratings FROM has_genre NATURAL JOIN ratings WHERE rating < 4 GROUP BY genre) AS low)
WHERE high_ratings > low_ratings;
genre | high_ratings | low_ratings |
---|---|---|
IMAX | 5501 | 3579 |
Drama | 2455297 | 1888901 |
Film-Noir | 94675 | 36917 |
Crime | 826375 | 648582 |
Western | 108365 | 102094 |
War | 348331 | 219732 |
Musical | 250613 | 230561 |
Documentary | 64986 | 38468 |
Romance | 977944 | 923939 |
Animation | 275590 | 243522 |
Mystery | 356799 | 274145 |
Find the genres that are associated with the most recent movies (genres that have more recent movies than old movies). Display the genre, the number of recent movies (>=2000) and the number of old movies (< 2000).
SELECT genre, recent, old FROM
((SELECT genre,count(*) AS recent FROM has_genre NATURAL JOIN movies WHERE year >= 2000 GROUP BY genre) AS recent
NATURAL JOIN
(SELECT genre,count(*) AS old FROM has_genre NATURAL JOIN movies WHERE year < 2000 GROUP BY genre) AS old)
WHERE recent > old;
genre | recent | old |
---|---|---|
Documentary | 252 | 230 |
Goal: De-bias the ratings of the users and find the top 10 movies (ids and titles) that had received the most biased ratings.
What does it mean for ratings to be biased? A user may rate all their movies with 5, or maybe, 1. We want to find these users and debias them. We will take a simple approach and debias those ratings by setting the rating = average rating for that movie. For users with a rating difference of 3 (arbitrary), we will replace the rating with the average.
Step 1: Find the difference between a user’s rating and the average rating of the movie he has rated. We will do this by creating a new table, ‘ratings_with_diff’, that includes all columns from table ratings, plus 2 new columns: avg rating and the difference (rating - avg_rating).
• First, we create a table ratings_with_diff(userid, movieid, rating, time) as a copy of ratings.
• Alter the table to add columns avg_rating and difference. The schema is now ratings_with_diff(userid, movieid, rating, time, avg_rating, difference).
• We create a table avg_ratings(movieid, avg_rating) that will contain the avg_rating for each movie, hence saving time populating ratings_with_diff later on. We then populate this table.
• We then UPDATE the avg_rating in ratings_with_diff with the averages from avg_ratings.
• We then UPDATE then difference column.
DROP TABLE IF EXISTS ratings_with_diff;
CREATE TABLE ratings_with_diff AS TABLE ratings;
ALTER TABLE ratings_with_diff ADD COLUMN avg_rating DOUBLE PRECISION;
ALTER TABLE ratings_with_diff ADD COLUMN difference DOUBLE PRECISION;
DROP TABLE IF EXISTS avg_ratings;
/* Helper table - avg_ratings, populate with averages by movie */
CREATE TABLE avg_ratings(movieid NUMERIC, avg_rating DOUBLE PRECISION);
INSERT INTO avg_ratings SELECT movieid, AVG(rating) AS avg_rating FROM ratings GROUP BY movieid;
/* Insert averages into ratings_with_diff */
UPDATE ratings_with_diff SET avg_rating = avg_ratings.avg_rating
FROM avg_ratings
WHERE ratings_with_diff.movieid = avg_ratings.movieid;
/* Find difference */
UPDATE ratings_with_diff SET difference = rating - avg_rating;
Step 2: Update the rating of users whose rating difference (absolute value) is > 3.
UPDATE ratings_with_diff r
SET rating = (SELECT avg_rating FROM avg_ratings WHERE r.movieid = avg_ratings.movieid)
WHERE @difference > 3;
Step 3: Find the new difference between a user’s rating and the average rating of the movie he has rated.
• We need to first take the new average of each movie. This will information will be stored in avg_ratings2.
• Next, update ratings_with_diff with the new averages.
• Then, find the new difference.
/* Make new table with new averages */
DROP TABLE IF EXISTS avg_ratings2;
CREATE TABLE avg_ratings2(movieid NUMERIC, avg_rating DOUBLE PRECISION);
INSERT INTO avg_ratings2 SELECT movieid, AVG(rating) AS avg_rating FROM ratings_with_diff GROUP BY movieid;
/* Update ratings_with_diff with new averages */
UPDATE ratings_with_diff
SET avg_rating = avg_ratings2.avg_rating
FROM avg_ratings2
WHERE ratings_with_diff.movieid = avg_ratings2.movieid;
/* Find new difference */
UPDATE ratings_with_diff SET difference = rating - avg_rating;
Step 4: Again, update the rating of users whose rating difference (absolute value) is > 3.
UPDATE ratings_with_diff r
SET rating = (SELECT avg_rating FROM avg_ratings2 WHERE r.movieid = avg_ratings2.movieid)
WHERE @difference > 3;
We end up with a table like this:
SELECT * FROM ratings_with_diff LIMIT 5;
userid | movieid | rating | time | avg_rating | difference |
---|---|---|---|---|---|
71523 | 1 | 3.0 | 1083880956 | 3.93888 | -0.9388803 |
20067 | 1 | 5.0 | 860341522 | 3.93888 | 1.0611197 |
71524 | 1 | 5.0 | 981499667 | 3.93888 | 1.0611197 |
53365 | 1 | 3.5 | 1155829307 | 3.93888 | -0.4388803 |
71519 | 1 | 2.5 | 1111629576 | 3.93888 | -1.4388803 |
Step 5: Find the average rating for each movie before the de-biasing (from the ratings table) and the average rating for each movie after the de-biasing (from the ratings_with_diff table). List the top 10 movies that have the biggest difference between these two average ratings. (These are the movies that had the most biased ratings.)
SELECT movieid, title, original, debiased, @original-debiased AS bias FROM
(SELECT movieid, title FROM movies) t1
NATURAL JOIN
(SELECT movieid, avg_rating AS original FROM avg_ratings) t2
NATURAL JOIN
(SELECT movieid, AVG(rating) AS debiased FROM ratings_with_diff GROUP BY movieid) t3
ORDER BY bias DESC LIMIT 10;
movieid | title | original | debiased | bias |
---|---|---|---|---|
8484 | Human Condition I The (Ningen no joken I) | 3.593750 | 4.173828 | 0.5800781 |
5793 | Time Changer | 1.928571 | 1.489796 | 0.4387755 |
6397 | Bizarre Bizarre (Drôle de drame ou L’étrange aventure de Docteur Molyneux) | 3.562500 | 3.945312 | 0.3828125 |
8423 | Kid Brother The | 3.558823 | 3.918685 | 0.3598616 |
26326 | Holy Mountain The (Montaña sagrada La) | 4.050000 | 4.405000 | 0.3550000 |
41627 | Samurai Rebellion (Jôi-uchi: Hairyô tsuma shimatsu) | 4.050000 | 4.405000 | 0.3550000 |
5889 | Cruel Romance A (Zhestokij Romans) | 3.555556 | 3.895062 | 0.3395062 |
6599 | Accattone | 3.642857 | 3.979592 | 0.3367347 |
25766 | Crowd The | 3.716667 | 4.038333 | 0.3216667 |
25930 | Odd Man Out | 3.886364 | 4.194215 | 0.3078512 |
Who is the most biased user?
If we define bias as having the most rows changed during de-biasing:
SELECT userid, count(*) FROM
(SELECT userid, movieid, rating AS original FROM ratings_with_diff) t1
NATURAL JOIN
(SELECT userid, movieid, rating AS debiased FROM ratings) t2
WHERE original <> debiased
GROUP BY userid
ORDER BY count(*) DESC
LIMIT 1;
userid | count |
---|---|
59342 | 223 |
In this query, for each userid we counted the number of rows where the original rating did not equal the debiased rating.
If we define bias as having the highest difference between original average and debiased average:
SELECT userid, original, debiased, @original-debiased AS bias FROM
(SELECT userid, AVG(rating) AS original FROM ratings GROUP BY userid) t2
NATURAL JOIN
(SELECT userid, AVG(rating) AS debiased FROM ratings_with_diff GROUP BY userid) t3
ORDER BY bias DESC LIMIT 1;
userid | original | debiased | bias |
---|---|---|---|
13496 | 0.5 | 3.877919 | 3.377919 |
In this query, for each userid, we calculated the average of their original ratings and the average of their debiased ratings, then took the difference between those two averages.
# Disconnect the database
dbDisconnect(con)