
A new open-source memory project, MemPalace, launched on April 22, 2026, claiming a "100% score on LoCoMo" and the "first perfect score on LongMemEval." While the announcement reached over 1.5 million views on social media, experts quickly dissected its credibility, leading to intense scrutiny regarding its methodologies.
MemPalace captured attention with its lofty claims, racking up over 7,000 stars on GitHub within the first 24 hours. Critics, however, have pointed out various inconsistencies in its reported scores, suggesting that the projectβs internal benchmarks reveal significant flaws.
Inflated LoCoMo Claims: The 100% score was attributed to a top_k bypass method using a top_k=50 setting. Given that each of LoCoMo's ten conversations contains fewer than 50 sessions, every ground-truth session is always included in the candidate pool. Experts noted this means the actual score should be around 60.3% R@10 without reranking, making the claimed perfection misleading.
LongMemEval Misrepresentation: The so-called perfect score is achieved through retrieval methods only, lacking the necessary answer generation and judgment phase. As one user observed, "Calling this a perfect score is a metric category error" since it merely reflects retrieval performance, not true evaluation.
Teaching to the Test: Some users criticized the way MemPalace achieved its scores, identifying specific targeted fixes for particular questions. "This feels like gaming the system," one commenter noted, highlighting the lack of genuine performance.
User responses have ranged from skepticism to outright criticism. One commenter noted, "If I get 100% anywhere, I fucked up," underlining the community sentiment towards inflated metrics in machine learning.
Negative Sentiment: Many assert that MemPalace's claims overstate its capabilities.
Skeptical Engagement: Users are actively debating the methodologies, suggesting that further examination could reveal deeper flaws.
Curiosity for Alternatives: Some are turning their gaze towards other memory systems for comparison.
Experts argue that the MemPalace launch and its subsequent critique illustrate the rampant issue of validation in AI benchmarks. As one user summarized, "The field needs standardized evaluation pipelines." Until that happens, sensational headlines will continue to dominate, while the nuanced discussions remain buried.
π¨ MemPalace's claims of 100% on LoCoMo examined as a top_k flaw
π Realistic LoCoMo numbers yield 60.3% without adjustments
β The LongMemEval score labeled a "metric category error"
ποΈ Community urges for better benchmark validation processes
This incident serves as a reminder of the importance of scrutinizing AI performance claims, ensuring that robust performance metrics become the standard in the rapidly evolving AI landscape.
As the dust settles around MemPalace's controversial claims, thereβs a strong chance the project will face increased scrutiny from not only experts but the broader AI community. Predictions suggest about a 70% probability that MemPalace will need to adjust its claims or provide more concrete methodologies to regain trust. Failure to address these issues could lead to diminished interest and slower adoption rates, with estimates of about 50% chance for potential rivals to seize the moment and introduce more reliable memory systems. The outcome here will hinge on community engagement and whether MemPalace adopts a more transparent approach in validating performance metrics.
In reflecting on MemPalace's scenario, one can draw an unusual parallel to the early days of aviation when heated competition led companies to exaggerate performance specs. Just as some early aircraft manufacturers made lofty claims to attract investors and buyers while overlooking safety standards, the tech world sometimes mirrors that eagerness to impress with numbers rather than realism. This historical lens underscores that while ambition is vital, accountability in achievements is essential to ensure that new technologies genuinely fulfill their promises.