Automated Classifications - AI for Classifying News Articles

Simon Zeru Image

Simon Zeru

Last modified : June 10, 2025 (1w ago)

📅 Period: January 2024 – January 2024

🛠️ Technologies: Java

👩‍💻 Expertise: Back-end development, AI algorithms, Data processing

🎓 Acquired skills: Developing an application, Optimizing IT applications, Understand and build algorithms, Working in a team

🔗 GitHub: View repository

📄 Report: See PDF

This article discusses the implementation of my automated classification project using AI and sorting algorithms.

Cover of the project

Project Overview

We developed an artificial intelligence to classify press articles based on their categories (Politics, Sports, Culture, etc.). GitHub was used for group collaboration.

Image of my code structure

To achieve this, we had to apply several algorithmic principles (Binary and sequential search, Merge sort) learned at the beginning of the year. We had to respect energy cost constraints to optimize our application's performance.

Image of the algorithm's complexity

Methodology

To classify articles, a category-specific lexicon is first built in two stages:

  1. Initialization: iterate over all pre-labeled articles for the target category, count each word’s occurrences, then rescan the entire dataset to adjust these counts—increment for occurrences in the same category, decrement for occurrences in other categories.
  2. Weighting: convert each word’s net score into a discrete weight (0–3) based on defined thresholds, retaining only words with weight > 0 in the final lexicon.

During classification, each Category instance loads its lexicon, iterates through the words of an article, and accumulates their weights to compute a total score. The article is then assigned to the category with the highest score. A merge sort is employed to speed up lexicon lookups, ensuring efficient execution even on large text volumes.

Project Results

The comparative analysis of search and sorting methods revealed significant differences in performance, emphasizing the importance of algorithm selection based on the constraints of the problem.

Our system achieved a classification rate above 50% on a sample of 21 news articles, validating the relevance of the lexical approach despite its limitations. Improvements are possible, particularly through the use of more advanced models such as neural networks trained on large datasets using backpropagation or gradient descent algorithms.

The entire work was presented during a live demonstration and documented in an English report submitted to the teaching staff.

Final grade: 17/20