Commit 1fd219ba authored by Amuthini Kulatheepan's avatar Amuthini Kulatheepan

Update ReadMe.md

parent 7c3abcfa
## Implement machine learning models to predict the personality background of job applicants from twitter tweets
# Development Tools
The natural language processing toolkit (NLTK), Pandas, numpy, re, seaborn, matplotlib and sklearn are other Python libraries which were used for the development process.
......@@ -18,4 +20,105 @@ Sklearn library was used to recognize the words appearing in 10% to 70% of the p
The classification task was divided into 16 classes and further into four binary classification tasks, since each MBTI type is made of four binary classes. Each one of these binary classes represents an aspect of personality according to the MBTI personality model. As a result, four different binary classifiers were trained, whereby each one specializes in one of the aspects of personality. Thus, in this step, a model for each type indicator was built individually. Term Frequency–Inverse Document Frequency (TF–IDF) was performed and MBTI type indicators were binarised. Variable X was used for posts in TF–IDF representation and variable Y was used for the binarised MBTI type indicator.
# Developing Model for the Dataset
SGDClassifier, XGBoost, and AdaBoost were used in this step to create the binary classification Models for the four dimesion of personality. MBTI type indicators were trained individually, and the data was then split into training and testing datasets using the train_test_split() function from sklearn library. In total, 70% of the data was used as the training set and 30% of the data was used as the test set. The model was fit onto the training data and the predictions were made for the testing data. After this step, the performance of the models on the testing dataset during training was evaluated and early stopping was monitored. MBTI type indicators were trained individually and then the data was split into training and testing datasets. The model was fit onto the training data and the predictions were made for the testing data. In this step, the performance of the all the models on the testing dataset was evaluated again.
\ No newline at end of file
SGDClassifier, XGBoost, and AdaBoost were used in this step to create the binary classification Models for the four dimesion of personality. MBTI type indicators were trained individually, and the data was then split into training and testing datasets using the train_test_split() function from sklearn library. In total, 70% of the data was used as the training set and 30% of the data was used as the test set. The model was fit onto the training data and the predictions were made for the testing data. After this step, the performance of the models on the testing dataset during training was evaluated and early stopping was monitored. MBTI type indicators were trained individually and then the data was split into training and testing datasets. The model was fit onto the training data and the predictions were made for the testing data. In this step, the performance of the all the models on the testing dataset was evaluated again.
# AdaBoost classifier best performance hyper parameter details
Intoversion - extroversion
Best AUC Score: 0.803667
Accuracy: 0.7285539643730353
[[2229 0]
[ 634 0]]
{'abc__learning_rate': 0.1, 'abc__n_estimators': 500}
NS: Intuition (N) – Sensing (S) ...
Best AUC Score: 0.6727666369367796
Accuracy: 0.8046946929265
[[2431 32]
[ 384 16]]
{'abc__learning_rate': 0.1, 'abc__n_estimators': 300}
FT: Feeling (F) - Thinking (T) ...
Best AUC Score: 0.75395340936081
Accuracy: 0.72895568376202319
[[1199 355]
[ 421 888]]
{'abc__learning_rate': 0.01, 'abc__n_estimators': 500}
JP: Judging (J) – Perceiving (P) ...
Best AUC Score: 0.6638994402640133
Accuracy: 0.6521131680055885
[[ 252 867]
[ 129 1615]]
{'abc__learning_rate': 0.1, 'abc__n_estimators': 500}
# XGBoost classifier best performance hyper parameter details
Intoversion - extroversion
Best AUC Score: 0.677028682166271
Accuracy: 0.7785539643730353
[[2229 0]
[ 634 0]]
{'xgb__n_estimators': 200, 'xgb__max_depth': 6, 'xgb__learning_rate': 0.01, 'xgb__gamma': 0.2, 'xgb__colsample_bytree': 0.1}
NS: Intuition (N) – Sensing (S) ...
Best AUC Score: 0.6527666346929265
Accuracy: 0.854697869367796
[[2431 32]
[ 384 16]]
{'xgb__n_estimators': 150, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.3, 'xgb__gamma': 0.2, 'xgb__colsample_bytree': 0.2}
FT: Feeling (F) - Thinking (T) ...
Best AUC Score: 0.8139538376202319
Accuracy: 0.728955640936081
[[1199 355]
[ 421 888]]
{'xgb__n_estimators': 150, 'xgb__max_depth': 4, 'xgb__learning_rate': 0.1, 'xgb__gamma': 0.1, 'xgb__colsample_bytree': 0.1}
JP: Judging (J) – Perceiving (P) ...
Best AUC Score: 0.6638994402640133
Accuracy: 0.6521131680055885
[[ 252 867]
[ 129 1615]]
{'xgb__n_estimators': 50, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.1, 'xgb__gamma': 0.0, 'xgb__colsample_bytree': 0.2}
# SGDClassifier best performance hyper parameter details
sgd
IE: Introversion (I) - Extroversion (E) ...
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best AUC Score: 0.5292548647534668
Accuracy: 0.7740132727907789
[[2216 0]
[ 647 0]]
{'sgd__alpha': 0.0009265019438562898, 'sgd__loss': 'modified_huber', 'sgd__penalty': 'l1'}
NS: Intuition (N) – Sensing (S) ...
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best AUC Score: 0.5426211797685075
Accuracy: 0.857492141110723
[[2455 1]
[ 407 0]]
{'sgd__alpha': 0.0011441798336083461, 'sgd__loss': 'modified_huber', 'sgd__penalty': 'l2'}
FT: Feeling (F) - Thinking (T) ...
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best AUC Score: 0.5
Accuracy: 0.5312609151239958
[[1521 0]
[1342 0]]
{'sgd__alpha': 0.0019410296620838965, 'sgd__loss': 'hinge', 'sgd__penalty': 'l1'}
JP: Judging (J) – Perceiving (P) ...
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best AUC Score: 0.4989554047081358
Accuracy: 0.5302130632203982
[[ 336 814]
[ 531 1182]]
{'sgd__alpha': 0.0004231316730058021, 'sgd__loss': 'modified_huber', 'sgd__penalty': 'none'}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment