Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • M mackoulflorida
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 1
    • Issues 1
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
Collapse sidebar
  • Rodolfo Weis
  • mackoulflorida
  • Issues
  • #1

Closed
Open
Created Feb 10, 2025 by Rodolfo Weis@rodolfoweis26Maintainer

If there's Intelligent Life out There


Optimizing LLMs to be excellent at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you acquire through links on our site, we might make an affiliate commission. Here's how it works.

Hugging Face has actually launched its 2nd LLM leaderboard to rank the finest language designs it has tested. The new leaderboard seeks to be a more tough uniform requirement for checking open big language model (LLM) performance throughout a variety of tasks. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking three areas in the leading 10.

Pumped to reveal the brand new open LLM leaderboard. We burned 300 H100 to re-run new evaluations like MMLU-pro for all major open LLMs!Some knowing:- Qwen 72B is the king and Chinese open designs are controling overall- Previous evaluations have actually ended up being too simple for recent ... June 26, 2024

Hugging Face's 2nd leaderboard tests language designs throughout four jobs: understanding screening, reasoning on exceptionally long contexts, intricate mathematics capabilities, and instruction following. Six standards are used to evaluate these qualities, with tests including solving 1,000-word murder secrets, explaining PhD-level concerns in layperson's terms, and most daunting of all: high-school math formulas. A complete breakdown of the benchmarks utilized can be discovered on Hugging Face's blog.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes first, 3rd, and 10th place with its handful of versions. Also appearing are Llama3-70B, Meta's LLM, and a handful of smaller open-source jobs that handled to exceed the pack. Notably absent is any sign of ChatGPT; Hugging Face's leaderboard does not test closed-source designs to ensure reproducibility of outcomes.

Tests to certify on the leaderboard are run specifically on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collective nature, anyone is complimentary to submit new models for screening and admission on the leaderboard, higgledy-piggledy.xyz with a new voting system prioritizing popular brand-new entries for testing. The leaderboard can be filtered to show only a highlighted array of substantial models to prevent a complicated excess of small LLMs.

As a pillar of the LLM space, Hugging Face has actually become a relied on source for LLM knowing and neighborhood partnership. After its very first leaderboard was released in 2015 as a method to compare and recreate testing arise from a number of recognized LLMs, the board rapidly took off in appeal. Getting high ranks on the board ended up being the objective of many designers, little and large, and as models have ended up being typically more powerful, 'smarter,' and enhanced for the particular tests of the first leaderboard, its outcomes have ended up being less and less meaningful, hence the production of a second variant.

Some LLMs, consisting of more recent variations of Meta's Llama, seriously underperformed in the brand-new leaderboard compared to their high marks in the very first. This came from a pattern of over-training LLMs only on the first leaderboard's benchmarks, causing falling back in real-world efficiency. This regression of performance, thanks to hyperspecific and self-referential information, follows a trend of AI efficiency growing even worse gradually, showing once again as Google's AI answers have shown that LLM performance is only as good as its training data and that true artificial "intelligence" is still numerous, numerous years away.

Remain on the Innovative: yewiki.org Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and in-depth evaluations, straight to your inbox.

Dallin Grimm is a for Tom's Hardware. He has been developing and breaking computer systems given that 2017, acting as the resident child at Tom's. From APUs to RGB, thatswhathappened.wiki Dallin has a manage on all the current tech news.

Moore Threads GPUs apparently reveal 'outstanding' inference performance with DeepSeek models

DeepSeek research study recommends Huawei's Ascend 910C delivers 60% of Nvidia H100 reasoning performance

Asus and MSI hike RTX 5090 and RTX 5080 GPU rates by as much as 18%

-. bit_user. LLM performance is just as great as its training information and that real synthetic "intelligence" is still many, lots of years away. First, this statement discounts the role of network architecture.

The definition of "intelligence" can not be whether something procedures details exactly like humans do, otherwise the look for additional terrestrial intelligence would be entirely futile. If there's smart life out there, it probably doesn't think rather like we do. Machines that act and behave smartly likewise needn't necessarily do so, either. Reply

-. jp7189. I do not enjoy the click-bait China vs. the world title. The fact is qwen is open source, open weights and can be run anywhere. It can (and has currently been) great tuned to add/remove bias. I praise hugging face's work to create standardized tests for LLMs, and for putting the focus on open source, open weights initially. Reply

-. jp7189. bit_user said:. First, this statement discount rates the function of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are different classes cognitive jobs and capabilities you might be acquainted with, if you study kid development or animal intelligence.

The meaning of "intelligence" can not be whether something procedures details precisely like humans do, otherwise the look for additional terrestrial intelligence would be entirely useless. If there's intelligent life out there, it probably does not believe rather like we do. Machines that act and act intelligently also needn't always do so, either. We're creating a tools to help human beings, therfore I would argue LLMs are more helpful if we grade them by human intelligence standards. Reply

- View All 3 Comments

Most Popular

Tomshardware belongs to Future US Inc, an international media group and leading digital publisher. Visit our corporate site.

- Conditions.

  • Contact Future's experts. - Privacy policy.
  • Cookies policy.
  • Availability Statement.
  • Advertise with us.
  • About us. - Coupons.
  • Careers

    © Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York City, NY 10036.
Assignee
Assign to
Time tracking