About

This project performs data processing on movie data and user ratings in R using regular expressions, then loading into a PostgreSQL database. We will perform various queries and do a debiasing of the database.

This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Each user is represented by an id.
MovieID is the real MovieLens id and the title should be one identical to that on IMDB.
Ratings are made on a 5-star scale, with half-star increments.
Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

This is the schema of our database:

Entities
movies(id, title, year)
genres(title)
users(id)

Relationships
ratings(userid, movieid, rating, time)
tags(userid, movieid, tag, time)
has_genre(movieid, title)

Data processing

In movies.txt, each line of this file represents one movie, and has the following format:
MovieID:Title:Genres

In tags.txt, each line of this file represents one tag applied to one movie by one user, and has the following format:
UserID:MovieID:Tag:Timestamp

In ratings.txt, each line of this file represents one rating of one movie by one user, and has the following format:
UserID:MovieID:Rating:Timestamp

The only file that can read in as is to Postgres is ratings.txt. The other files must be edited as there are delimiter errors.

The following R code parses our files for us:

library(stringr)
movies <- readLines("movies.txt")

parseMovies <- function(movie){
  # split by first ":"
  temp <- unlist(str_split(movie, ":", 2))
  movieid <- temp[1]
  # looks for last "):"  
  temp2 <- unlist(strsplit(temp[2], "\\):(?=[^:]+$)", perl=TRUE))
  genres <- temp2[2]
  
  # looks for last "("
  temp3 <- unlist(strsplit(temp2[1], " \\((?=[^(]+$)", perl=TRUE))
  title <- temp3[1]
  year <- temp3[2]
  return(data.frame(movieid = movieid, title = title, year = year, genres = genres))
}

df <- data.frame(movieid = character(), title = character(), year = character(), genres = character())

for (i in 1:length(movies)){
  df <- rbind(df, parseMovies(movies[i]))
}

movies_csv <- df[,c("movieid", "title", "year")]
write.csv(movies_csv, file ="movies.csv", row.names = F)

has_genres <- df[,c("movieid", "genres")]
genresList <- data.frame(movieid = character(), genre = character())
for (i in 1:nrow(has_genres)){
  list <- unlist(strsplit(as.character(has_genres[i,]$genres), split="\\|"))
  for (j in 1:length(list))
    genresList <- rbind(genresList, data.frame(movieid = has_genres[i,]$movieid, genre = list[j]))
}

write.csv(genresList, file ="hasGenres.csv", row.names = F)

tags <- readLines("tags.txt")
parseTags <- function(tag){
  temp <- unlist(str_split(tag, ":", 3))
  userid <- temp[1]
  movieid <- temp[2]
  # looks for the last ":"
  temp2 <- unlist(strsplit(temp[3], ":(?=[^:]+$)", perl=TRUE))
  tag <- temp2[1]
  timestamp <- temp2[2]
  return(data.frame(userid = userid, movieid = movieid, tag = tag, timestamp = timestamp))
}

tags_df <- data.frame(userid = character(), movieid = character(), tag = character(), timestamp = character())
for (i in 1:length(tags)){
  res <- parseTags(tags[i])
  tags_df <- rbind(tags_df, res)
  print(i)
}

write.csv(tags_df, file ="tags.csv", row.names = F)

What does this code do?!

parseMovies

See a sample of how parseMovies() parses movies.txt to create movies.csv, using the following input:

1:Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy

It first splits the line into two pieces on the first :.

Array[1]: 1
Array[2]: Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy

Array[1] contains the movieid. Then, it splits on the last ): in Array[2] to split the year and genre list. By splitting on ): and removing the genre list, it makes it easier to extract the year later.

Array[1]: Toy Story (1995
Array[2]: Adventure|Animation|Children|Comedy|Fantasy

Array[2] contains the genre list. Next we should split Array[1] by looking for the last (. We look for the last ( in case the movie title contains parentheses.

Array[1]: Toy Story
Array[2]: 1995

And we are left with the following variables:

movieid: 1
genres: Adventure|Animation|Children|Comedy|Fantasy
title: Toy Story
year: 1995

parseTags

See a sample of how parseTags() parses tags.txt to tags.csv, using the following input:

15:4973:excellent!:1215184630

It first splits the line into three pieces on the first and second :.

Array[1]: 15
Array[2]: 4973
Array[3]: excellent!:1215184630

The function then parses Array[3] on the last :. By parsing on the last :, if the tag contains a :, we ignore that.

We are then left with the following variables:

userid: 15
movieid: 4973
tag: excellent!
timestamp: 1215184630

genresList

Using the dataframe created after processing movies.txt, I then processed the genres to export to hasGenres.csv by converting the rows from:

movieid: 1
genres: Adventure|Animation|Children|Comedy|Fantasy

To:

movieid genre
1 Adventure
1 Animation
1 Children
1 Comedy
1 Fantasy

The R script takes a row from the dataframe containing the columns movieid, title, year, genres and creates a new dataframe with movieid and genre. It splits genres on the | delimiter, and for each genre in the list, adds to the dataframe.

Connect to database in RMarkdown

library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="moviedb", user="postgres", password="p")

We are connected!
Let’s create our tables, and then add our data.

SET CLIENT_ENCODING TO 'utf8';

CREATE TABLE Ratings(userid NUMERIC, movieid NUMERIC, rating DOUBLE PRECISION, timestamp NUMERIC);
COPY ratings FROM 'C:/Users/claire/project/ratings.txt' DELIMITER ':';

CREATE TABLE movies(movieid NUMERIC, title TEXT, year NUMERIC);
COPY movies FROM 'C:/Users/claire/project/movies.csv' CSV HEADER;

CREATE TABLE tags(userid NUMERIC, movieid NUMERIC, tag TEXT, time NUMERIC);
COPY tags FROM 'C:/Users/claire/project/tags.csv' CSV HEADER;

CREATE TABLE users(userid NUMERIC PRIMARY KEY);
INSERT INTO users SELECT DISTINCT userid FROM ratings;
INSERT INTO users SELECT DISTINCT userid FROM tags ON CONFLICT DO NOTHING;

CREATE TABLE has_genre(movieid NUMERIC, genre TEXT);
COPY has_genre FROM 'C:/Users/claire/project/hasGenres.csv' CSV HEADER;

CREATE TABLE Genres(title TEXT);
INSERT INTO Genres SELECT DISTINCT genre FROM has_genre ORDER BY genre;

Looking at our tables
Let’s see how many rows we have and take a look at the first 5 rows of each table.
Note that we are casting count(*) as an integer only because RMarkdown has issues with integer-64 encodings.

Movies table

SELECT CAST(count(*) AS INTEGER) FROM movies;
count
10681
SELECT * FROM movies LIMIT 5;
5 records
movieid title year
1 Toy Story 1995
2 Jumanji 1995
3 Grumpier Old Men 1995
4 Waiting to Exhale 1995
5 Father of the Bride Part II 1995

Ratings table

SELECT CAST(count(*) AS INTEGER) FROM ratings;
count
10000054
SELECT * FROM ratings LIMIT 5;
5 records
userid movieid rating time
1 122 5 838985046
1 185 5 838983525
1 231 5 838983392
1 292 5 838983421
1 316 5 838983392

Tags table

SELECT CAST(count(*) AS INTEGER) FROM tags;
count
95580
SELECT * FROM tags LIMIT 5;
5 records
userid movieid tag time
15 4973 excellent! 1215184630
20 1747 politics 1188263867
20 1747 satire 1188263867
20 2424 chick flick 212 1188263835
20 2424 hanks 1188263835

Users table

SELECT CAST(count(*) AS INTEGER) FROM users;
count
71567
SELECT * FROM users LIMIT 5;
userid
21424
38355
47095
60438
33597

Has_genre table

SELECT CAST(count(*) AS INTEGER) FROM has_genre;
count
21563
SELECT * FROM has_genre LIMIT 5;
5 records
movieid genre
1 Adventure
1 Animation
1 Children
1 Comedy
1 Fantasy

Now let’s try some more advanced queries to find out more about what’s in our database.

Find the distribution of the movies across different decades.

SELECT dist_group, count(*)
FROM
(
SELECT case when year between 1910 and 1919 then '(1910-1919)'
when year between 1920 and 1929 then '(1920-1929)'
when year between 1930 and 1939 then '(1930-1939)'
when year between 1940 and 1949 then '(1940-1949)'
when year between 1950 and 1959 then '(1950-1959)'
when year between 1960 and 1969 then '(1960-1969)'
when year between 1970 and 1979 then '(1970-1979)'
when year between 1980 and 1989 then '(1980-1989)'
when year between 1990 and 1999 then '(1990-1999)'
when year between 2000 and 2009 then '(2000-2009)'
end AS dist_group
FROM movies
) t
GROUP BY dist_group
ORDER BY dist_group ASC;
10 records
dist_group count
(1910-1919) 11
(1920-1929) 83
(1930-1939) 230
(1940-1949) 379
(1950-1959) 521
(1960-1969) 690
(1970-1979) 784
(1980-1989) 1712
(1990-1999) 3022
(2000-2009) 3249

A better approach in the case when we don’t know the distribution of years:

SELECT concat(floor(year/10), 0, '-', floor(year/10), 9) AS decade, count(*)
FROM movies GROUP BY 1 ORDER BY 1 ASC;
10 records
decade count
1910-1919 11
1920-1929 83
1930-1939 230
1940-1949 379
1950-1959 521
1960-1969 690
1970-1979 784
1980-1989 1712
1990-1999 3022
2000-2009 3249

Find the distribution of the genres across the movies.

SELECT genre, count(genre)
FROM has_genre
GROUP BY genre
ORDER BY genre ASC;
19 records
genre count
Action 1473
Adventure 1025
Animation 286
Children 528
Comedy 3703
Crime 1118
Documentary 482
Drama 5339
Fantasy 543
Film-Noir 148
Horror 1013
IMAX 29
Musical 436
Mystery 509
Romance 1685
Sci-Fi 754
Thriller 1706
War 511
Western 275

Find the distribution of the ratings values.

SELECT rating, count(rating)
FROM ratings
GROUP BY rating
ORDER BY rating ASC;
10 records
rating count
0.5 94988
1.0 384180
1.5 118278
2.0 790306
2.5 370178
3.0 2356676
3.5 879764
4.0 2875850
4.5 585022
5.0 1544812

Find how many movies have:

i. no tags, but they have ratings

SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid
NOT IN (SELECT DISTINCT movieid FROM tags)
AND movieid IN (SELECT DISTINCT movieid FROM ratings);
count
3080

ii. no ratings, but they have tags

SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid
IN (SELECT DISTINCT movieid FROM tags WHERE movieid NOT IN (SELECT DISTINCT movieid FROM ratings));
count
4

iii. no tags and no ratings

SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid
NOT IN (SELECT DISTINCT movieid FROM ratings) AND movieid NOT IN (SELECT DISTINCT movieid FROM tags);
count
0

iv. both tags and ratings

SELECT CAST(count(movieid) AS INTEGER) FROM movies
WHERE movieid IN ((SELECT DISTINCT movieid FROM ratings) INTERSECT (SELECT DISTINCT movieid FROM tags));
count
7597

We expect that the results from i, ii, iii, and iv will add up to count(movieid) from movies. 3080 + 4 + 0 + 7597 = 10,681 (correct)

Find the most reviewed movie (the movie with the highest number of reviews).

SELECT movies.movieid, title, count_ratings
FROM movies
JOIN (SELECT count(*) AS count_ratings, movieid FROM ratings GROUP BY movieid ORDER BY count_ratings DESC LIMIT 1) t
ON movies.movieid = t.movieid;
1 records
movieid title count_ratings
296 Pulp Fiction 34864

Find the highest reviewed movie (movie with the most 5-star reviews).

SELECT movies.movieid, title, count
FROM movies
JOIN (SELECT movieid, count(*) AS count FROM ratings WHERE rating = 5 GROUP BY movieid ORDER BY count DESC LIMIT 1) t
ON movies.movieid = t.movieid;
1 records
movieid title count
318 Shawshank Redemption The 16460

Find the number of movies that are associated with at least 4 different genres.

SELECT CAST(count(*) AS INTEGER)
FROM (SELECT count(genre) AS genre_count FROM has_genre GROUP BY movieid) t
WHERE genre_count >= 4;
count
968

Find the most popular genre across all movies (genre associated with the highest number of movies).

SELECT genre, count(genre)
FROM has_genre
GROUP BY genre
ORDER BY count DESC LIMIT 1;
1 records
genre count
Drama 5339

Find the genres that are associated with the best reviews (genres of movies that have more high ratings than low ratings). Display the genre, the number of high ratings (>=4.0) and the number of low ratings (< 4.0).

SELECT genre, high_ratings, low_ratings FROM
((SELECT genre,count(*) AS high_ratings FROM has_genre NATURAL JOIN ratings WHERE rating >= 4 GROUP BY genre) AS high
NATURAL JOIN
(SELECT genre,count(*) AS low_ratings FROM has_genre NATURAL JOIN ratings WHERE rating < 4 GROUP BY genre) AS low)
WHERE high_ratings > low_ratings;
11 records
genre high_ratings low_ratings
IMAX 5501 3579
Drama 2455297 1888901
Film-Noir 94675 36917
Crime 826375 648582
Western 108365 102094
War 348331 219732
Musical 250613 230561
Documentary 64986 38468
Romance 977944 923939
Animation 275590 243522
Mystery 356799 274145

Find the genres that are associated with the most recent movies (genres that have more recent movies than old movies). Display the genre, the number of recent movies (>=2000) and the number of old movies (< 2000).

SELECT genre, recent, old FROM
((SELECT genre,count(*) AS recent FROM has_genre NATURAL JOIN movies WHERE year >= 2000 GROUP BY genre) AS recent
NATURAL JOIN
(SELECT genre,count(*) AS old FROM has_genre NATURAL JOIN movies WHERE year < 2000 GROUP BY genre) AS old)
WHERE recent > old;
1 records
genre recent old
Documentary 252 230

Debiasing of the database

Goal: De-bias the ratings of the users and find the top 10 movies (ids and titles) that had received the most biased ratings.
What does it mean for ratings to be biased? A user may rate all their movies with 5, or maybe, 1. We want to find these users and debias them. We will take a simple approach and debias those ratings by setting the rating = average rating for that movie. For users with a rating difference of 3 (arbitrary), we will replace the rating with the average.

Step 1: Find the difference between a user’s rating and the average rating of the movie he has rated. We will do this by creating a new table, ‘ratings_with_diff’, that includes all columns from table ratings, plus 2 new columns: avg rating and the difference (rating - avg_rating).

• First, we create a table ratings_with_diff(userid, movieid, rating, time) as a copy of ratings.
• Alter the table to add columns avg_rating and difference. The schema is now ratings_with_diff(userid, movieid, rating, time, avg_rating, difference).
• We create a table avg_ratings(movieid, avg_rating) that will contain the avg_rating for each movie, hence saving time populating ratings_with_diff later on. We then populate this table.
• We then UPDATE the avg_rating in ratings_with_diff with the averages from avg_ratings.
• We then UPDATE then difference column.

DROP TABLE IF EXISTS ratings_with_diff;
CREATE TABLE ratings_with_diff AS TABLE ratings;
ALTER TABLE ratings_with_diff ADD COLUMN avg_rating DOUBLE PRECISION;
ALTER TABLE ratings_with_diff ADD COLUMN difference DOUBLE PRECISION;
DROP TABLE IF EXISTS avg_ratings;
/* Helper table - avg_ratings, populate with averages by movie */
CREATE TABLE avg_ratings(movieid NUMERIC, avg_rating DOUBLE PRECISION);
INSERT INTO avg_ratings SELECT movieid, AVG(rating) AS avg_rating FROM ratings GROUP BY movieid;
/* Insert averages into ratings_with_diff */
UPDATE ratings_with_diff SET avg_rating = avg_ratings.avg_rating
FROM avg_ratings
WHERE ratings_with_diff.movieid = avg_ratings.movieid;
/* Find difference */
UPDATE ratings_with_diff SET difference = rating - avg_rating;

Step 2: Update the rating of users whose rating difference (absolute value) is > 3.

UPDATE ratings_with_diff r
SET rating = (SELECT avg_rating FROM avg_ratings WHERE r.movieid = avg_ratings.movieid)
WHERE @difference > 3;

Step 3: Find the new difference between a user’s rating and the average rating of the movie he has rated.
• We need to first take the new average of each movie. This will information will be stored in avg_ratings2.
• Next, update ratings_with_diff with the new averages.
• Then, find the new difference.

/* Make new table with new averages */
DROP TABLE IF EXISTS avg_ratings2;
CREATE TABLE avg_ratings2(movieid NUMERIC, avg_rating DOUBLE PRECISION);
INSERT INTO avg_ratings2 SELECT movieid, AVG(rating) AS avg_rating FROM ratings_with_diff GROUP BY movieid;
/* Update ratings_with_diff with new averages */
UPDATE ratings_with_diff
SET avg_rating = avg_ratings2.avg_rating
FROM avg_ratings2
WHERE ratings_with_diff.movieid = avg_ratings2.movieid;
/* Find new difference */
UPDATE ratings_with_diff SET difference = rating - avg_rating;

Step 4: Again, update the rating of users whose rating difference (absolute value) is > 3.

UPDATE ratings_with_diff r
SET rating = (SELECT avg_rating FROM avg_ratings2 WHERE r.movieid = avg_ratings2.movieid)
WHERE @difference > 3;

We end up with a table like this:

SELECT * FROM ratings_with_diff LIMIT 5;
5 records
userid movieid rating time avg_rating difference
71523 1 3.0 1083880956 3.93888 -0.9388803
20067 1 5.0 860341522 3.93888 1.0611197
71524 1 5.0 981499667 3.93888 1.0611197
53365 1 3.5 1155829307 3.93888 -0.4388803
71519 1 2.5 1111629576 3.93888 -1.4388803

Step 5: Find the average rating for each movie before the de-biasing (from the ratings table) and the average rating for each movie after the de-biasing (from the ratings_with_diff table). List the top 10 movies that have the biggest difference between these two average ratings. (These are the movies that had the most biased ratings.)

SELECT movieid, title, original, debiased, @original-debiased AS bias FROM
(SELECT movieid, title FROM movies) t1
NATURAL JOIN
(SELECT movieid, avg_rating AS original FROM avg_ratings) t2
NATURAL JOIN
(SELECT movieid, AVG(rating) AS debiased FROM ratings_with_diff GROUP BY movieid) t3
ORDER BY bias DESC LIMIT 10;
10 records
movieid title original debiased bias
8484 Human Condition I The (Ningen no joken I) 3.593750 4.173828 0.5800781
5793 Time Changer 1.928571 1.489796 0.4387755
6397 Bizarre Bizarre (Drôle de drame ou L’étrange aventure de Docteur Molyneux) 3.562500 3.945312 0.3828125
8423 Kid Brother The 3.558823 3.918685 0.3598616
26326 Holy Mountain The (Montaña sagrada La) 4.050000 4.405000 0.3550000
41627 Samurai Rebellion (Jôi-uchi: Hairyô tsuma shimatsu) 4.050000 4.405000 0.3550000
5889 Cruel Romance A (Zhestokij Romans) 3.555556 3.895062 0.3395062
6599 Accattone 3.642857 3.979592 0.3367347
25766 Crowd The 3.716667 4.038333 0.3216667
25930 Odd Man Out 3.886364 4.194215 0.3078512

Who is the most biased user?
If we define bias as having the most rows changed during de-biasing:

SELECT userid, count(*) FROM
(SELECT userid, movieid, rating AS original FROM ratings_with_diff) t1
NATURAL JOIN
(SELECT userid, movieid, rating AS debiased FROM ratings) t2
WHERE original <> debiased
GROUP BY userid
ORDER BY count(*) DESC
LIMIT 1;
1 records
userid count
59342 223

In this query, for each userid we counted the number of rows where the original rating did not equal the debiased rating.

If we define bias as having the highest difference between original average and debiased average:

SELECT userid, original, debiased, @original-debiased AS bias FROM
(SELECT userid, AVG(rating) AS original FROM ratings GROUP BY userid) t2
NATURAL JOIN
(SELECT userid, AVG(rating) AS debiased FROM ratings_with_diff GROUP BY userid) t3
ORDER BY bias DESC LIMIT 1;
1 records
userid original debiased bias
13496 0.5 3.877919 3.377919

In this query, for each userid, we calculated the average of their original ratings and the average of their debiased ratings, then took the difference between those two averages.

# Disconnect the database
dbDisconnect(con)