Project Overview
We developed an artificial intelligence to classify press articles based on their categories (Politics, Sports, Culture, etc.). GitHub was used for group collaboration.
To achieve this, we had to apply several algorithmic principles (Binary and sequential search, Merge sort) learned at the beginning of the year. We had to respect energy cost constraints to optimize our application's performance.
Methodology
To classify articles, a category-specific lexicon is first built in two stages:
- Initialization: iterate over all pre-labeled articles for the target category, count each wordâs occurrences, then rescan the entire dataset to adjust these countsâincrement for occurrences in the same category, decrement for occurrences in other categories.
- Weighting: convert each wordâs net score into a discrete weight (0â3) based on defined thresholds, retaining only words with weight > 0 in the final lexicon.
During classification, each Category
instance loads its lexicon, iterates through the words of an article, and accumulates their weights to compute a total score. The article is then assigned to the category with the highest score. A merge sort is employed to speed up lexicon lookups, ensuring efficient execution even on large text volumes.
Project Results
The comparative analysis of search and sorting methods revealed significant differences in performance, emphasizing the importance of algorithm selection based on the constraints of the problem.
Our system achieved a classification rate above 50% on a sample of 21 news articles, validating the relevance of the lexical approach despite its limitations. Improvements are possible, particularly through the use of more advanced models such as neural networks trained on large datasets using backpropagation or gradient descent algorithms.
The entire work was presented during a live demonstration and documented in an English report submitted to the teaching staff.
Final grade: 17/20