Commit a5a093d7 authored by Amuthini Kulatheepan's avatar Amuthini Kulatheepan

Update it17173100/Personality_prediction/ReadMe.md

parent bfbfadc6
1. Development Tools
The natural language processing toolkit (NLTK) and XGBoost which is an optimised distributed Gradient Boosting library in Python, were used for the development process. Pandas, numpy, re, seaborn, matplotlib and sklearn are other Python libraries that were used.
2. Dataset for Training the Model
The dataset containing 8675 rows of data, was used in this research. In this dataset, each row consists of two columns. The first column is for the MBTI personality type of a given person, and the second column includes fifty posts obtained from the individual’s social media. This data has been collected from the users of an online forum, where in the first step, users take a questionnaire that recognises their MBTI type; and in the second step, communicate with other users .
3. Categorizing the Type Indicators in Four Dimensions
Four different categories were created for the type indicators in order to understand the distribution of types indicators in the dataset. The first category was for Introversion (I)/Extroversion (E), the second category was for Intuition (N)/Sensing (S), the third was for Thinking (T)/Feeling (F) and the fourth category was for Judging (J)/Perceiving (P). As a result, for each category, one letter will return and at the end there will be four letters that represent one of the 16 personality types in the MBTI.
5. Pre-processing the Dataset
NLTK was used to remove the MBTI types from the dataset. In addition, all urls and stop words were removed from the dataset. Finally, in order to make the dataset more meaningful, the text was lemmatised, i.e., inflected forms of the words were transformed into their root words.
6. Vectorise with Count and Term Frequency–Inverse Document Frequency (TF–IDF)
Sklearn library was used to recognize the words appearing in 10% to 70% of the posts. In the first step, posts were placed into a matrix of token counts. In the next step, the model learns the vocabulary dictionary and returns a term-document matrix. The count matrix then transforms into a normalised TF–IDF representation which can be used for the Gradient Boosting model. Finally, 791 words appear in 10% to 70% of the posts.
7. Classification Task
The classification task was divided into 16 classes and further into four binary classification tasks, since each MBTI type is made of four binary classes. Each one of these binary classes represents an aspect of personality according to the MBTI personality model. As a result, four different binary classifiers were trained, whereby each one specializes in one of the aspects of personality. Thus, in this step, a model for each type indicator was built individually. Term Frequency–Inverse Document Frequency (TF–IDF) was performed and MBTI type indicators were binarised. Variable X was used for posts in TF–IDF representation and variable Y was used for the binarised MBTI type indicator.
8. Developing Gradient Boosting Model for the Dataset
Numpy, XGBoost and sklearn were used in this step to create the Gradient Boosting Model. MBTI type indicators were trained individually, and the data was then split into training and testing datasets using the train_test_split() function from sklearn library. In total, 70% of the data was used as the training set and 30% of the data was used as the test set. The model was fit onto the training data and the predictions were made for the testing data. After this step, the performance of the XGBoost model on the testing dataset during training was evaluated and early stopping was monitored. Following this step, the learning rate in XGBoost should be set to 0.1 or lower, and the addition of more trees will be required for smaller values. Moreover, the depth of trees should be configured in the range of 2 to 8, as there is not much benefit seen with the deeper trees. Furthermore, row sampling should be configured in the range of 30% to 80% of the training dataset. Thus, tree_depth in the created XGBoost was configured and parameters for XGBoost were setup as follow: n_estimators=200 max_depth=2 nthread=8 learning_rate=0.2 MBTI type indicators were trained individually and then the data was split into training and testing datasets. The model was fit onto the training data and the predictions were made for the testing data. In this step, the performance of the XGBoost model on the testing dataset was evaluated again.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment