Technology

65819 readers

5233 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

Hello GPT-4o (openai.com)

submitted 10 months ago by [email protected] to c/[email protected]

44 comments fedilink hide all child comments

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 5 points 10 months ago

"they can't learn anything" is too reductive. Try feeding GPT4 a language specification for a language that didn't exist at the time of its training, and then tell it to program in that language given a library that you give it.

It won't do well, but neither would a junior developer in raw vim/nano without compiler/linter feedback. It will roughly construct something that looks like that new language you fed it that it wasn't trained on. This is something that in theory LLMs can do well, so GPT5/6/etc. will do better, perhaps as well as any professional human programmer.

Their context windows have increased many times over. We're no longer operating in the 4/8k range, but instead 128k->1024k range. That's enough context to, from the perspective of an observer, learn an entirely new language, framework, and then write something almost usable in it. And 2024 isn't the end for context window size.

With the right tools (e.g input compiler errors and have the LLM reflect on how to fix said compiler errors), you'd get even more reliability, with just modern day LLMs. Get something more reliable, and effectively it'll do what we can do by learning.

So much work in programming isn't novel. You're not making something really new, but instead piecing together work other people did. Even when you make an entirely new library, it's using a language someone else wrote, libraries other people wrote, in an editor someone else wrote, on an O.S someone else wrote. We're all standing on the shoulders of giants.