Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • X xn----dtbgbdqk-2bclip-1l
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 1
    • Issues 1
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
Collapse sidebar
  • Colette Moreno
  • xn----dtbgbdqk-2bclip-1l
  • Issues
  • #1

Closed
Open
Created Feb 11, 2025 by Colette Moreno@colettemorenoMaintainer

If there's Intelligent Life out There


Optimizing LLMs to be proficient at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you buy through links on our site, fakenews.win we may make an affiliate commission. Here's how it works.

Hugging Face has actually released its 2nd LLM leaderboard to rank the very best language models it has tested. The new leaderboard seeks to be a more challenging consistent requirement for checking open big language model (LLM) efficiency throughout a variety of jobs. Alibaba's Qwen models appear dominant in the leaderboard's inaugural rankings, taking 3 areas in the top 10.

Pumped to announce the brand brand-new open LLM leaderboard. We burned 300 H100 to re-run brand-new evaluations like for all significant open LLMs!Some knowing:- Qwen 72B is the king and Chinese open models are controling total- Previous evaluations have actually ended up being too simple for current ... June 26, 2024

Hugging Face's 2nd leaderboard tests language models throughout four tasks: understanding screening, reasoning on incredibly long contexts, complicated math abilities, and instruction following. Six benchmarks are used to evaluate these qualities, with tests including fixing 1,000-word murder secrets, explaining PhD-level concerns in layperson's terms, and many overwhelming of all: high-school math equations. A full breakdown of the criteria utilized can be found on Hugging Face's blog site.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th location with its handful of variants. Also revealing up are Llama3-70B, Meta's LLM, forum.batman.gainedge.org and a handful of smaller open-source tasks that handled to exceed the pack. Notably missing is any indication of ChatGPT; Hugging Face's leaderboard does not check closed-source models to make sure reproducibility of outcomes.

Tests to certify on the leaderboard are run solely on Hugging Face's own computer systems, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collective nature, anyone is totally free to send new designs for screening and admission on the leaderboard, with a brand-new ballot system focusing on popular brand-new entries for testing. The leaderboard can be filtered to reveal just a highlighted variety of considerable models to avoid a complicated glut of small LLMs.

As a pillar of the LLM space, Hugging Face has become a relied on source for LLM knowing and neighborhood partnership. After its first leaderboard was released in 2015 as a method to compare and recreate testing results from a number of recognized LLMs, the board quickly removed in appeal. Getting high ranks on the board became the objective of lots of developers, little and large, and as designs have ended up being typically more powerful, 'smarter,' and enhanced for thatswhathappened.wiki the particular tests of the first leaderboard, its results have actually become less and less significant, hence the production of a second variant.

Some LLMs, consisting of more recent variations of Meta's Llama, significantly underperformed in the new leaderboard compared to their high marks in the first. This came from a pattern of over-training LLMs only on the very first leaderboard's criteria, leading to falling back in real-world performance. This regression of efficiency, kenpoguy.com thanks to hyperspecific and self-referential information, follows a trend of AI performance growing worse with time, showing as soon as again as Google's AI responses have shown that LLM efficiency is only as good as its training data and that real artificial "intelligence" is still numerous, several years away.

Remain on the Leading Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's finest news and thorough evaluations, straight to your inbox.

Dallin Grimm is a contributing author for Tom's Hardware. He has actually been building and breaking computers since 2017, working as the resident youngster at Tom's. From APUs to RGB, Dallin guides all the current tech news.

Moore Threads GPUs supposedly reveal 'exceptional' inference efficiency with DeepSeek designs

DeepSeek research suggests Huawei's Ascend 910C delivers 60% of Nvidia H100 inference efficiency

Asus and MSI trek RTX 5090 and RTX 5080 GPU prices by approximately 18%

-. bit_user. LLM efficiency is just as great as its training information and that real synthetic "intelligence" is still numerous, numerous years away. First, this declaration discount rates the function of network architecture.

The definition of "intelligence" can not be whether something procedures details exactly like humans do, otherwise the look for additional terrestrial intelligence would be entirely useless. If there's intelligent life out there, it most likely doesn't think quite like we do. Machines that act and behave wisely also needn't always do so, either. Reply

-. jp7189. I do not like the click-bait China vs. the world title. The truth is qwen is open source, open weights and can be run anywhere. It can (and has actually currently been) fine tuned to add/remove predisposition. I praise hugging face's work to develop standardized tests for LLMs, and for putting the concentrate on open source, timeoftheworld.date open weights first. Reply

-. jp7189. bit_user said:. First, this statement discounts the role of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are numerous classes cognitive tasks and capabilities you might be acquainted with, if you study kid advancement or animal intelligence.

The meaning of "intelligence" can not be whether something processes details precisely like people do, or else the look for additional terrestrial intelligence would be entirely futile. If there's smart life out there, it probably does not think rather like we do. Machines that act and behave smartly also need not necessarily do so, either. We're creating a tools to assist humans, therfore I would argue LLMs are more useful if we grade them by human intelligence requirements. Reply

- View All 3 Comments

Most Popular

Tomshardware is part of Future US Inc, a worldwide media group and leading digital publisher. Visit our business site.

- Terms and conditions.

  • Contact Future's professionals. - Privacy policy. - Cookies policy. - Availability Statement. - Advertise with us.
  • About us. - Coupons.
  • Careers

    © Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York City, NY 10036.
Assignee
Assign to
Time tracking