Final Evaluation: Model Execution, Test Design, Observed Behavior, and Conclusions

This evaluation focused on LiquidAI/LFM2.5-1.2B-Instruct-GGUF:Q8_0, a compact instruction-tuned language model developed by Liquid AI, with the explicit goal of determining whether a small, fully on-device model can function as a credible scientific reasoning tool when assessed using real problems rather than synthetic benchmarks.

Model Deployment and Execution Environment

The model was executed entirely on an Android smartphone (OnePlus 8), representative of older, pre–AI-accelerator mobile hardware.

The deployment used a minimal and fully open toolchain: Termux provided a Linux-like runtime environment on Android. llama.cpp was used as the inference backend, compiled for mobile execution.

The model was downloaded in GGUF format from Hugging Face and loaded directly into llama.cpp. Inference was launched using llama-server, exposing a local HTTP interface on port 8080. A standard mobile web browser connected to this local endpoint, providing a chat-style interface for extended evaluation.

This setup effectively turned the smartphone into a standalone, offline AI workstation, capable of long-context, multi-turn reasoning sessions without cloud connectivity or external tooling.

Despite the hardware constraints, the model sustained approximately 7–10 tokens per second during interactive use. Combined with its unusually large context window (~120K tokens), this allowed for extended analytical exchanges without truncation or loss of coherence.

Importantly, this performance was achieved without dedicated NPUs, suggesting that newer AI-focused mobile hardware could support substantially higher throughput for the same model class.

Evaluation Methodology: Real Problems Over Benchmarks

Rather than relying on leaderboard metrics, the evaluation deliberately employed manually constructed problems drawn from undergraduate-level mathematics and physics—domains where correctness, internal consistency, and verifiability matter more than stylistic fluency.

The tests were designed to probe not only whether the model could produce an answer, but whether it could:

Correctly identify the governing principles.
Apply appropriate solution methods.
Maintain numerical and conceptual consistency.
Verify results against independent constraints.

The evaluation covered three broad categories:

Mathematical Reasoning

The model was tested on: Linear systems solved via Gaussian elimination, including explicit row operations and substitution-based verification.

First-order differential equations requiring recognition of equation type, correct selection of the integrating factor method, handling of non-elementary integrals, and enforcement of initial conditions.

In these tasks, the model consistently arrived at correct final solutions and verified them appropriately. While intermediate reasoning sometimes included exploratory or imperfect steps—such as unnecessary trial solutions or momentary algebraic looseness—the final expressions satisfied the equations and constraints exactly. Importantly, the model demonstrated the ability to recover from flawed intermediate paths and converge on valid solutions.

Physics Reasoning

Classical mechanics problems were used to test physical reasoning across multiple formulations, including: Newtonian force analysis (with correct treatment of friction and sign conventions), Kinematic equations, Independent verification using the work–energy principle.

The model’s solutions were numerically stable and internally consistent, with force-based and energy-based approaches yielding identical results. This cross-law agreement is a strong indicator of genuine physical reasoning rather than surface-level pattern matching.

Conceptual Physics and Narrative Explanation

To probe conceptual understanding, the model was asked to explain physics ideas—such as inertia, reference frames, and gravity—in narrative form. These prompts tested whether the model could convey correct intuition without relying on equations.

The resulting explanations were largely accurate and avoided major misconceptions. While some conceptual boundaries (e.g., between inertia and time) were described with mild poetic looseness, the core physical principles remained intact. This behavior is typical of small models and did not undermine overall correctness.