OCR System for Financial Records
Problem. Manual processing of financial records and invoices is slow, error-prone, and inefficient, making it difficult to accurately extract and manage key financial data.
What I took away. Ongoing
2nd-year Applied Computer Science Engineering student at The National Engineering School of Sousse. I enjoy building things across the full software spectrum, with a growing focus on machine learning & data science.
A short list, in my own words. Not a mission statement.
Part-time, in the thick of it. Document intelligence and computer vision at Core Techs Solutions.
A few that taught me something worth keeping.
Problem. Manual processing of financial records and invoices is slow, error-prone, and inefficient, making it difficult to accurately extract and manage key financial data.
What I took away. Ongoing
Problem. Assessing the physical condition of used books manually is subjective, time-consuming, and inconsistent, especially at scale for resale platforms or libraries.
What I took away. Applied computer vision to real-world data, handled noisy and imbalanced datasets, optimized model performance through hyperparameter tuning and deployment formats (ONNX, TensorRT), and gained experience in data preprocessing and evaluation.
Problem. Drafting emails and meeting summaries from rough bullet points, turning them into something readable without losing the writer's tone.
What I took away. Fine-tuning an LLM on business communication data, and the gap between raw model output and something a real person would send.
Problem. An investment committee evaluating a Tunisian AgriTech startup's Series A needs to query a dense, heterogeneous document corpus — financials, legal statutes, market studies — accurately and without hallucination, citing exact figures and refusing when information is absent.
What I took away. Retrieval quality drives answer accuracy far more than model size: hybrid semantic + BM25 + RRF fusion, semantic prefixes, and a hard confidence threshold are what keep the system grounded. Between the two implementations, the Python/LangChain pipeline is the more serious tool — persistent storage, query rewriting, and hybrid retrieval — while n8n trades depth for ease of deployment.
Problem. Predicting install counts across a 3M-row dataset of Play Store apps, and prototyping a market-research tool for indie developers.
What I took away. Handling scale with pandas and Keras without the notebook falling over, and how much cleaning matters before the model sees anything.
Problem. Predicting essential soil nutrients to help optimize maize yields on African farms.
What I took away. How to pull signal out of environmental and satellite data (Sentinel, MODIS), and how much a good time-series feature matters.
Latest notes.
If something here sparked a question, reach out. I answer.