I recently wrote a simple PHP web app for creating Twitter data collection campaigns. It allows you to download hundreds of thousands of tweets [tested upto 1.2 million tweets on an Amazon EC2 instance] based on a specific set of keywords. The data it produces is stored in a MySQL database which can further be converted to a CSV or any other format of your choice.

The data collected is properly formatted and stored in a MySQL database, here are the fields that are recorded:

Fields per Tweet

scraper-dataset

 

Fields per Twitter User

scraper-dataset2

The source code/app can be downloaded from here.

The instructions for its usage are detailed in this blog post.

Step 1: Create a database

step1-scraper

Step 2: Create a new Twitter App [link]. Get your apps credentials from Twitter and add them up in db/140dev_config.php.step0-scraper

Step 3: Edit db/db_config.php to match your own MySQL database server settings

step2-scraper

 

Step 4: Go to your MySQL database and import the database structure [mysql_database_schema.sql] inside the db folder into your database.step3-scraper

Before proceeding forward make sure that you have libssh2 installed on your server, if you haven’t you’ll have to install it now since the script requires the ssh2 library to create a new ssh session to its host machine and run the data collector script in a new screen session.

Ubuntu: sudo apt-get install libssh2-php

OPTIONAL: If you don’t want the script to SSH into your machine you’ll have to run your campaign yourself by running the following commands via a terminal on your host machine:

_php gettweets.php &

_php parsetweets.php

Replace with the name you chose while creating the campaign. 

Important : Don’t forget to restart your webserver once the installation is complete.

step5-scraper

Update SSH credentials to match those of your server in the file NiceSSH.class.php

step6-scraper

 

Step 5 [the last step!]: Open up your Webserver to the directory of the TFDC project, and click on create a new campaign.

step4-scraper

 

comments powered by Disqus