One of the best features of
dplyr is that it allows you to rearrange and summarize data from different sources using the same functions. Data read in from text/csv and data read in from a database can be treated similarly. If a dataset is too large to load into memory putting it into a database and using familiar
dplyr functions is a simple fix. It turns out that the entire process of creating a database, loading data to a database and connecting
dplyr is pretty simple.
To start download the Movie Lens Dataset. The biggest version of this dataset is around 150 Mb – not a large dataset by any means but an excellent example that should be easy to work with.
dplyr can use both sqlite and mysql as well as other database types. Installing
mysql is fairly straight forward on most non-windows operating systems. For OS X,
homebrew is probably the easiest way:
Install MySQL with homebrew on OS X:
Install MySQL with apt on Ubuntu:
On other linux distributions and windows you are on your own.
Setup administrator – in practice it might be better to make a new user rather than just using root. From the terminal:
Next, startup the MySQL Server and login:
Creating databases and tables
Now you should be in the MySQL command line. Create a database where the movielens data will be stored.
List the databases available using:
Create database, select it, list the tables in the database and show all the columns from a table:
For a new database the tables and columns should be empty.
The table must be created correctly to be able to store the incoming datasets. The dataset contains 3 files: movies.dat, ratings.dat, and users.dat.
Here are the first few lines of each dataset:
1::Toy Story (1995)::Animation|Children's|Comedy 2::Jumanji (1995)::Adventure|Children's|Fantasy 3::Grumpier Old Men (1995)::Comedy|Romance 4::Waiting to Exhale (1995)::Comedy|Drama 5::Father of the Bride Part II (1995)::Comedy 6::Heat (1995)::Action|Crime|Thriller 7::Sabrina (1995)::Comedy|Romance 8::Tom and Huck (1995)::Adventure|Children's 9::Sudden Death (1995)::Action 10::GoldenEye (1995)::Action|Adventure|Thriller
From the documentation it appears that the columns are movie id, movie name and genres delimited by “::”.
1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 1::3408::4::978300275 1::2355::5::978824291 1::1197::3::978302268 1::1287::5::978302039 1::2804::5::978300719 1::594::4::978302268 1::919::4::978301368
The columns are user id, movie id, rating, timestamp (probably in unix time).
1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117 4::M::45::7::02460 5::M::25::20::55455 6::F::50::9::55117 7::M::35::1::06810 8::M::25::12::11413 9::M::25::17::61614 10::F::35::1::95370
The columns are user id, gender, age, occupation and zipcode.
The above datasets correspond to the tables created below:
Loading data into database
Once the tables are created then to load the data from the text files:
To make sure that the tables were created correctly and that data was loaded correctly we can look at the top 10 rows from each table:
Loading MySQL into R
Once the databases are created and the data is in the database correctly, from R install
RMySQL. Then enter the following commands to connect
dplyr to the databased just created:
At this point can can treat the three databases as R dataframes.