Logging Data in R Loops: Applied to Twitter.

A problem that many users face in R is storing the output from loop operations. In the case of Twitter, we may be requesting the last specified number of Tweets from a number of Twitter users. Several methods exist for this purpose which include the use of lists, or simply printing the output from the file. This tutorial proposes the use of the sqldf package, making use of the SQL insert statement. The package will access the Twitter API  version 1.  Before we start it will be necessary to install the sqldf package to store the data, and the rjson to handle the data returned from the Twitter API.

library(rjson)
library(sqldf)

Before we start to record any data to the we need to create an empty data frame to log the data.

 df <- data.frame(user_name=character(1),
                 user_id=character(1),
                 lat=character(1),
                 long=character(1),
                 created=character(1),
                 Text=character(1),
                 stringsAsFactors=FALSE)

With the empty data.frame created we can initiate the SQLite database. This is achieved by running the sqldf() function. If the connection is command box will return: <SQLiteConnection: DBI CON (3793, 1)> . The next line as we have seen in previous tutorials loads the data.frame into SQLite database and finally we clear any content using the delete from SQL call.

sqldf() 
sqldf("select * from df", drv="SQLite")
sqldf("delete from main.df", drv="SQLite")

Create a character lists of user names to be searched. user_names could be a much longer list, or a column of usernames in an existing data.frame.

 user_names <- c("jot", "","","")

The loop is fairly self explanatory. based on the number of values in user_names found using the length() statement the query is made to the Twitter API. for each user, all the returned values are recorded into the database. Due to the rate limit imposed in API V1.0 the loop is designed to sleep for an hour after 145 requests are made.

for(i in 1:length(user_names)){
 if(i == 1) c(j <- -1, k <- 1)
 i=1
 #Construct and execute twitter API call.
 json_file <- paste("http://api.twitter.com/1/statuses/user_timeline.json?include_entities=true&include_rts=true&screen_name=", user_names[1] ,"&count=200", sep="") 
 tryCatch(json_data <- fromJSON(paste(readLines(json_file), collapse=""), unexpected.escape="skip"), finally=next)

 # Load data into SQLite Database
 for( j in 1:length(json_data)) {
 sqldf(paste("insert into main.df values ('",user_names[i],"','", json_data[[j]]$user$id,"','", json_data[[j]]$geo$coordinates[1],"','", json_data[[j]]$geo$coordinates[2],"','", json_data[[j]]$created_at,"','", gsub("'", "",json_data[[j]]$text),"')", sep="")) 
 } 

 # Code to deal with Twitter Rate Limit 
 k <- k+1 if(k==145) c(k <- 1, Sys.sleep(3660), print(paste("having a sleep at", Sys.time()))) 
}

Once the loop is completed there are two options for accessing the data. In both cases, these make use of the sqldf queries demonstrated in my last tutorial. The first option queries all messages. The second statement removes any non located tweets.

 # Select all results from scraping including non-located tweets
out <- sqldf("select * from main.df where lat <> ''", drv="SQLite")
write.csv(out, "allTweets.csv")

# Select all results from scraping excluding non-located tweets
outFiltered <- sqldf("select * from main.df where lat <> ''", drv="SQLite")
write.csv(out, "filteredTweets.csv")

That concludes this tutorial. Though it may not be the most elegant solution, it does highlight the versatility of the sqldf package. If you have any questions feel free to post them in the comments below.

 

 

This entry was posted in R, r-tutorial, Tutorial. Bookmark the permalink.