Novice SQL query question for a movie ratings database
I have a database with one table, like so:
UserID (int), MovieID (int), Rating (real)
The userIDs and movieIDs are large numbers, but my database only has a sample of the many possible values (4000 unique users, and 3000 unique movies)
I am going to do a matrix SVD (singular value decomposition) on it, so I want to return this database as an ordered array. Basically, I want to return each user in order, and for each user, return each movie in order, and then return the rating for that user, movie pair, or null if that user did not rate that particular movie. example:
USERID | MOVIEID | RATING ------------------------- 99835 8847874 4 99835 8994385 3 99835 9001934 null 99835 3235524 2 . . . 109834 8847874 null 109834 8994385 1 109834 9001934 null etc
This way, I can simply read these results into a two dimensional array, suitable for my SVD algorithm. (Any other suggestions for getting a database of info into a simple two dimensional array of floats would be appreciated)
It is important that this be returned in order so that when I get my two dimensional array back, I will be able to re-map the values to the respective users and movies to do my analysis.
SELECT m.UserID, m.MovieID, r.Rating FROM (SELECT a.userid, b.movieid FROM (SELECT DISTINCT UserID FROM Ratings) AS a, (SELECT DISTINCT MovieID FROM Ratings) AS b ) AS m LEFT OUTER JOIN Ratings AS r ON (m.MovieID = r.MovieID AND m.UserID = r.UserID) ORDER BY m.UserID, m.MovieID;
Now tested and it seems to work!
The concept is to create the cartesian product of the list of UserID values in the Ratings table with the list of MovieID values in the Ratings table (ouch!), and then do an outer join of that complete matrix with the Ratings table (again) to collect the ratings values.
This is NOT efficient.
It might be effective.
You might do better though to just run the plain simple select of the data, and arrange to populate the arrays as the data arrives. If you have many thousands of users and movies, you are going to be returning many millions of rows, but most of them are going to have nulls. You should treat the incoming data as a description of a sparse matrix, and first set the matrix in the program to all zeroes (or other default value), and then read the stream from the database and set just the rows that were actually present.
That query is the basically trivial:
SELECT UserID, MovieID, Rating FROM Ratings ORDER BY UserID, MovieID;