## Implement machine learning models to predict the personality background of job applicants from twitter tweets
# Development Tools
The natural language processing toolkit (NLTK), Pandas, numpy, re, seaborn, matplotlib and sklearn are other Python libraries which were used for the development process.
...
...
@@ -18,4 +20,105 @@ Sklearn library was used to recognize the words appearing in 10% to 70% of the p
The classification task was divided into 16 classes and further into four binary classification tasks, since each MBTI type is made of four binary classes. Each one of these binary classes represents an aspect of personality according to the MBTI personality model. As a result, four different binary classifiers were trained, whereby each one specializes in one of the aspects of personality. Thus, in this step, a model for each type indicator was built individually. Term Frequency–Inverse Document Frequency (TF–IDF) was performed and MBTI type indicators were binarised. Variable X was used for posts in TF–IDF representation and variable Y was used for the binarised MBTI type indicator.
# Developing Model for the Dataset
SGDClassifier, XGBoost, and AdaBoost were used in this step to create the binary classification Models for the four dimesion of personality. MBTI type indicators were trained individually, and the data was then split into training and testing datasets using the train_test_split() function from sklearn library. In total, 70% of the data was used as the training set and 30% of the data was used as the test set. The model was fit onto the training data and the predictions were made for the testing data. After this step, the performance of the models on the testing dataset during training was evaluated and early stopping was monitored. MBTI type indicators were trained individually and then the data was split into training and testing datasets. The model was fit onto the training data and the predictions were made for the testing data. In this step, the performance of the all the models on the testing dataset was evaluated again.
\ No newline at end of file
SGDClassifier, XGBoost, and AdaBoost were used in this step to create the binary classification Models for the four dimesion of personality. MBTI type indicators were trained individually, and the data was then split into training and testing datasets using the train_test_split() function from sklearn library. In total, 70% of the data was used as the training set and 30% of the data was used as the test set. The model was fit onto the training data and the predictions were made for the testing data. After this step, the performance of the models on the testing dataset during training was evaluated and early stopping was monitored. MBTI type indicators were trained individually and then the data was split into training and testing datasets. The model was fit onto the training data and the predictions were made for the testing data. In this step, the performance of the all the models on the testing dataset was evaluated again.
# AdaBoost classifier best performance hyper parameter details