Commit 02c695f3 authored by Mendis M.C.P.'s avatar Mendis M.C.P.

Update README.md

parent 17b0e510
# 2021-118
##### Main Objective
Comparing to current gazette platform systems, the main objective of this system is the search feature. Since all the gazette systems provide a simple, normal search feature which will helps users to search the gazette using the published date, gazette number and the topic name, this system will allow users to search by gazette content also. This system will be the very first one which is going to provide that kind a facility.
Main Research questions
1) Modern system displays the gazette in a Portable Document Format (PDF) so it seems confusing to identify the content by reading whole PDF document
2)Storing gazettes in a repository and how to manage big data including issues like privacy and ethics in collecting and storing data, mental issues from dealing with too much data and technical issues related to storing data into a text engine.
3)People who are seeking information have to read a very long description to find one specific information.
4)Gazette content is not classified according to their main topic and sub-topic
##### • IT18115208 | M.C.P. Mendis
##### Individual research question
Nowadays gazette platforms are limited to only for the reading and the downloading options. Users can only search the gazettes by title and the published date. The main problem for gazette users is, when they want to search some particular information, they have to read the entire gazette. Without reading the entire gazette they need some solution that helps to search the required information quickly.
##### Individual Objectives
The main objective of implementing the Centralized Gazette Platform is to provide users to search the gazette content which is in a Portable Document Format (PDF). This helps to search by key words, key phrases and using paragraph. We are planning to use Apache SOLR to implement the search engine, since SOLR achieve fast search responses because, instead of searching the text directly, it searches an index instead. According to SOLR document is a unit of index and search.
In order to reach the main objective, the sub objectives need to be attained are below mentioned.
1. Download all the gazette PDFs exiting in Department of Government Printers.
2. Extract data from large number of PDF gazette documents.
3. Find the best preprocessing method which gives the highest accuracy.
4. Indexing
##### • IT18188264 | M.S.M. Perera
##### Individual research question
Anonymous information is often treated as unwanted information. Because so much information is created by so many contradictory sources and various forms, agents, etc., so managing a large amount of data requires high performance and many-dimensional management tools, otherwise, we may have acceptable acceptance. On the other hand, when we use more information, we can make more accurate decisions. So that there is another issue with storage capacities. Also if the storage goes high the performance will vary. Then the final results are not generated accurately and timely so no use for it.
To overcome these issues, we can build a distributed system using the Hadoop framework where we can divide tasks and work in parallel. The aim of comparing default Hadoop systems to learn more about the parameters involved in managing the performance of the system, recognize how those parameters operate and look for Hadoop's default system's drawbacks. An algorithm that is effective is important for optimizing Hadoop machine efficiency for processing big data. In contrast to the default configuration, several researchers have implemented a new algorithm to improve the performance of Hadoop system.
Since there are different kinds of massive data such as text, numbers, images, videos, etc, we can use different types of algorithms to make better performance. The problem is recognizing which deep learning text classification model suits more for Sri Lankan Gazette text.
##### Individual Objectives
When we use massive data, it is not practical to use traditional data management systems or single-node architecture to manage data. In this domain, the main objective is to handle big data to optimize content base search for Sri Lankan gazette text. To handle is the issue, this component will build a distributed system that will be able to handle requests, storage, and retrieval of big data. Store millions of records on various machines (raw data), so keep records on what record occurs inside the data center on which node. Also, to simplify the details, how do we run the processes on all these machines?
##### • IT18195194 | Karunarathne J.M.P.D
##### Individual research question
The main problem is the unavailability of a similar kind of platform (text clustered and automatically text summarized for main departments and for their information)for the local parties who are interested in referring government-issued gazettes to gather information with a minimum search time with high satisfaction.
##### Individual Objectives
The Main goal of implementing the "CENTRALIZED GAZETTE PLATFORM" is to make it easier for users to meet their needs and to turn the gazette into an interactive document to read. Given that gazette is a weekly-realized document and to make it more interactive, we can reach the above-mentioned objective easily. As mentioned previously the above document is used by a lot of people who work in different sides of government, private and even self-employed, so they use this document daily weekly in many ways to identify and search for their needs.
In the main objective “CENTRALIZED GAZETTE PLATFORM” this part of the research, mainly focus on Data Optimization and making a user-friendly view to the day-to-day users. According to the survey which I have done, it shows many comments regarding the difficulty in reading and what amount of effort should we hold on it to make something extracted and another main thing which was explained on it was the user interfaces are not so easier and attractive so to overcome from all of these issues this objective will be very useful. Therefore, by using this effective way to optimizing data we can make users to our gazette platform an attractive place to visit and get some of the valuable details within limited seconds of time. The way we solve the issue and the technologies and methods we used to detect data and preprocess and finalized the data that explained in the methodology, which is available in the latter part of our document.
Specific Objectives
1. To optimize the data stored according to specific contents.
2. To summarize the obtain description.
3. To analyze the summarized description into understandable sentence.
4. To provide user-friendly attractive dashboards to the user.
##### • IT18396164 | Silva K.K.S
##### Individual research question
Gazettes are a very useful thing in human life. The department of government printers releases the gazettes in three languages. The main disadvantage of these currently used gazette systems’ is it does not include a search feature that filters the keywords within the gazette document. The current system includes only data wise search. After that, we think to build a suitable model to solve this problem. Clustering is unsupervised machine learning technique. we cannot assign number of clusters and assign meaning full name to the cluster according to the cluster content manually. It is the main research problem.
How to collect the number of gazettes at the same time and how to store the text?
How to classify the gazette content into main topic and subtopic?
How to divide the gazette content into clusters and subclusters?
How to assign meaningful titles for the sub-clusters and the number of clusters automatically?
What is the most suitable machine learning approaches?
What are the most suitable natural language processing approaches?
##### Individual Objectives
The main objective is to build a desktop application for classifying/categorizing gazette content and, after that, clustering gazette. The application can be divided into clusters, sub-clusters and automatically assign a meaningful title for further convenience. When a user came to search a gazette user will be able to select their gazette according to main categories like Posts - Vacant, Examinations, Results of Examinations, etc. After that user will be able to choose subcategories like Exam (law exam, Banking exam, O/L Exam), Post-Vacant (army vacancies, banking vacancies)
Figure 10-Steps of the implementation the system
##### Specific Objectives
• By using web scraping the data in the gazette is extracted.
• Using OCR technique to convert image format into text.
• The extracted data are stored in Amazon S3.
• Applying Natural Language processing technique to classifying the main topic of the gazette content.(post-vacancies,exams,result of examination).
• Then classify each sub-category content(post-vacant:-vacant1,vacant2,vacant3)
• Apply the Most suitable machine learning algorithm to create a do
• cument clustering and sub clustering model according to their content.
• Assigning a suitable title for the clusters and the number of sub-clusters automatically because of clustering is unsupervised learning. Then applying the most suitable machine learning algorithm to solve this problem.
• Create a desktop application function to present the view and display the accurate result with the solution.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment