Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
2
2021-193 User-friendly enhanced machine learning-based railway management system
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
2021-193
2021-193 User-friendly enhanced machine learning-based railway management system
Commits
9c700478
Commit
9c700478
authored
Jul 04, 2021
by
Weerasooriya W.K.M-IT18085822
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
c0279b50
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
43 additions
and
0 deletions
+43
-0
nltk_utils.py
nltk_utils.py
+43
-0
No files found.
nltk_utils.py
0 → 100644
View file @
9c700478
import
numpy
as
np
import
nltk
#nltk.download('punkt')
from
nltk.stem.porter
import
PorterStemmer
stemmer
=
PorterStemmer
()
def
tokenize
(
sentence
):
"""
split sentence into array of words/tokens
a token can be a word or punctuation character, or number
"""
return
nltk
.
word_tokenize
(
sentence
)
def
stem
(
word
):
"""
stemming = find the root form of the word
examples:
words = ["organize", "organizes", "organizing"]
words = [stem(w) for w in words]
-> ["organ", "organ", "organ"]
"""
return
stemmer
.
stem
(
word
.
lower
())
def
bag_of_words
(
tokenized_sentence
,
words
):
"""
return bag of words array:
1 for each known word that exists in the sentence, 0 otherwise
example:
sentence = ["hello", "how", "are", "you"]
words = ["hi", "hello", "I", "you", "bye", "thank", "cool"]
bog = [ 0 , 1 , 0 , 1 , 0 , 0 , 0]
"""
# stem each word
sentence_words
=
[
stem
(
word
)
for
word
in
tokenized_sentence
]
# initialize bag with 0 for each word
bag
=
np
.
zeros
(
len
(
words
),
dtype
=
np
.
float32
)
for
idx
,
w
in
enumerate
(
words
):
if
w
in
sentence_words
:
bag
[
idx
]
=
1
return
bag
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment