
Using online streaming tweets-data from Twitter API, this is an attempt to trace out the most appropriate sentiment related to the topic of interest. I’ve started with Naive Bayes Classifier and ended up by creating my own classifier by setting votes on which classifier to be considered more. My classifier is based on many classifiers namely Basic Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Stochastic Gradient Descent and Linear SVM classifiers combined together. It basically creates votes for each classification and returns the result of that classifier whose statistical mode of votes is maximum.


Install Python3.x, and for all first install these dependencies using pip - tweepy, json, StreamListner, nltk, scikit-learn, sklearn, scipy, pandas, numpy, matplotlib, pickle.

pip3 install dependency_name

For nltk, we need to download the corpus. This is done as follows.. Enter ‘all’ when prompted.

import nltk

For getting live data from twitter, go to the twitter apps site. Attach mobile number to your twitter account and create app. You’ll be given a consumer key and consumer secret key. Go down and click on authorise app. You’ll be given an auth key and auth secret key. Keep them confidential.

Module created for Sentiment Analysis

short_pos = codecs.open("positive.txt","r",encoding="latin2").read()
short_neg = codecs.open("negative.txt","r",encoding="latin2").read()
documents = []
all_words = []
allowed_words_types = ["J"]
for r in short_pos.split('\n'):
	words = word_tokenize(r)
	pos = nltk.pos_tag(words)
    	for w in pos:
		if w[1][0] in allowed_words_types:
for r in short_neg.split('\n'):
	words = word_tokenize(r)
	pos = nltk.pos_tag(words)
	for w in pos:
		if w[1][0] in allowed_words_types:
all_words = nltk.FreqDist(all_words) #frequencies
word_features = list(all_words.keys())[:5000]
def find_features(document):
	words = word_tokenize(document)
	features = {}
	for w in word_features:
		features[w] = bool(w in words)
		return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
training_set = featuresets[:10000]
testing_set = featuresets[10000:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
# classifier = SklearnClassifier(MultinomialNB()).train(training_set)
# classifier = SklearnClassifier(BernoulliNB()).train(training_set)
# classifier = SklearnClassifier(LogisticRegression()).train(training_set)
# classifier = SklearnClassifier(SGDClassifier()).train(training_set)
# classifier = SklearnClassifier(LinearSVC()).train(training_set)
# classifier = SklearnClassifier(NuSVC()).train(training_set)
print("Accuracy:", nltk.classify.accuracy(classifier, testing_set))
class VoteClassifier(ClassifierI):
	def __init__(self, *classifiers):
		self._classifiers = classifiers
	def classify(self, features):
		votes = [c.classify(features) for c in self._classifiers]
		return mode(votes)
	def confidence(self, features):
		votes = [c.classify(features) for c in self._classifiers]
		choice_votes = votes.count(mode(votes))
		conf = choice_votes / len(votes)
		return conf 
voted_classifier = VoteClassifier(classifier, MNB_classifier, BernoulliNB_classifier, LogisticRegression_classifier, 
				SGDClassifier_classifier, LinearSVC_classifier, NuSVC_classifier)
print("voted_classifier accuracy:", nltk.classify.accuracy(voted_classifier, testing_set))
def sentiment(text):
	featset = find_features(text)
	return voted_classifier.classify(featset)

Using this model for Twitter streaming data

class listener(StreamListener):
	def on_data(self, data):
			all_data = json.loads(data)
			tweet = all_data["text"]
			sentiment_value, confidence = s.sentiment(tweet)
			print(tweet, sentiment_value, confidence)
			if(confidence*100 >= 80):
				output = open("twitter-out.txt","a")
		except Exception as e:
	def on_error(self, status):
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())

Results from predictions

Screenshots of plots can be found here.