AI Logs
Posts
Why GPT-4o is a landmark moment and is Google killing search with SGE?

Why GPT-4o is a landmark moment and is Google killing search with SGE?

Plus: Is the future of AI nuclear?

Duncan West
May 15, 2024

Get IE’s ‘My Intelligence is Artificial’ t-shirt.

Elevate your streetwear game with our Gildan Heavyweight T-Shirt. Crafted from 100% cotton.

The Interesting Engineering Shop is now live with a curated collection of merchandise designed specially for engineers.

Shop Now

It's been quite a week in AI, and it's only Wednesday. So strap yourselves in; we've got a lot to cover.

On Monday, OpenAI dropped ChatGPT-4o, then on Tuesday Google demonstrated Gemini 1.5 Flash and Pro and also announced the roll-out of its AI-generated answers in US search results.

ChatGPT-4omg

ChatGPT-4o (the ‘o’ is for ‘omni’) is here and being rolled out. We take a look at some of its capabilities in our Must Read down below, but first let's look at what makes it different and why it's one of the most significant leaps for AI.

GPT-4o can finally render text in images

With GPT-4o, OpenAI has fixed two major issues that held AI back in the level of usability that would lead to widespread adoption. These were multi-modality and latency. Multi-modality being the ability to work with text, audio and video, and latency being the speed of its responses.

Previously, engaging with ChatGPT primarily involved typing questions and waiting for text responses. While a voice interface existed, it was often cumbersome and slow.

This week’s demos from OpenAI showcased a different reality. ChatGPT can now understand text, audio and video in near real-time. The speed and smoothness of its responses give interactions a human-like quality and its ability to use your phone’s camera to understand your surroundings make it vastly more useful.

Two GPTs were shown doing live translation, describing surroundings for the visually impaired and even harmonizing.

GPT-4o now feels like the kind of personal AI science fiction always promised us. With many of its features being available in the free version, expect a lot more people to discover its potential.

It’s not all been smooth sailing for OpenAI though, as just this morning Sam Altman announced the departure of Ilya Sutskever, their co-founder and Chief Scientist.

The full details are not yet clear, but Sutskever tweeted: “I am excited for what comes next — a project that is very personally meaningful to me about which I will share details in due time.”

Is Google killing search?

Google’s I/O announcements were a little overshadowed by OpenAI’s, but they did demonstrate updates for Gemini 1.5 Pro and Gemini Flash.

Significantly, they unveiled plans to incorporate Search Generative Experience (SGE) directly into Google Search for US users, bringing AI overviews from Search Labs into its general search results pages.

I remember when this was all links.

Anyone who works on the web knows the importance of Google search for discoverability and traffic. These changes could be transformational, and in some cases devastating, for online creators.

Tomasz Rudziki at ZipTie published an incredibly indepth look at SGE and what it means for search. He found SGE displays: “take over 82% of search pages, leading to fewer clicks and potentially lower revenue, effectively keeping more of the traffic in Google.com.”

Additionally only 47% of the top 10 traditional web results appear as sources for SGE.

Rudziki adds that “SGE covers 87% of PPC queries over $1 so Google Ads may stop being a cheap source of getting traffic and leads for brands”.

This is concerning, not only in the reduction of visibility for organic search results, but also in the pushing of an AI generated answer that may not be 100% reliable.

Are we nearly there yet?

With all these incredible announcements and the incorporation of AI into so many aspects of our lives, one thing being swept under the carpet is the reliability of AI. In fact, people are finding GPT-4o to be even less reliable and more prone to hallucinations than previous models.

There’s a big ethical question here: as Google’s AI answers dominate search, ChatGPT becomes an invaluable personal assistant and Llama dominates social media - shouldn’t accuracy and reliability of information be the number one concern?

Are OpenAI, Google and Meta acting responsibly in pushing AI more and more into people’s lives without resolving a fundamental problem? And are they doing enough to highlight and educate users on its shortcomings?

In Google’s announcements on Tuesday, Gemini even gave factually wrong information in a demonstration video. If Google can’t spot this in a keynote presentation, what hope for the rest of us?

Did a friend forward this e-mail to you?

IE₊SUPPORT INTERESTING ENGINEERING
_{Invest In Science And Engineering}

Enjoy exclusive access to the forefront of AI content, highlighting trends and news that shape the future. Join a community passionate about AI, delve into the latest AI breakthroughs, and be informed with our AI-focused weekly premium newsletters. With IE+, AI reporting goes beyond the ordinary - and it is Ad-Free.

NEWS

The ‘great filter’ hypothesis: Is AI behind the lack of alien life?
A chilling theory suggests AI could be the universe’s ‘great filter,’ potentially hindering civilizations’ survival.
Stanford engineers develop ‘killer app’ of holographic heads-up displays
The new headset could revolutionize various industries, from gaming to medical sciences and education.
ZeroEyes’ AI-powered surveillance cameras to soon detect guns in US schools
In an effort to enhance security measures in schools, officials are considering sanctioning millions on ZeroEyes’s patented AI camera surveillance system technology.
Forget facial recognition, this new AI predicts age by analyzing hands
Haut.AI’s study demonstrated new AI models trained on hand images achieve comparable accuracy to those using facial images.
US races to develop AI-powered, GPS-free fighter jets, outpacing China
While the gauntlet has not been officially thrown down, officials are convinced the race is on to master military AI.

MUST READ

GPT-4o: OpenAI unveils faster, free AI model

OpenAI launched its latest flagship generative AI model GPT-4o, an iteration of its GPT-4, with a phased release across its developer and consumer-oriented products in the coming weeks.

During a keynote address at OpenAI’s headquarters, CTO Muri Murati highlighted GPT-4o’s advancements over its predecessor, GPT-4, boasting enhanced capabilities spanning text, vision, and audio processing.

“GPT-4o operates seamlessly across voice, text, and vision,” Murati explained.

While GPT-4 excelled in tasks involving images and text analysis, GPT-4o integrates speech processing, broadening its spectrum of capabilities.

LEXICON PODCAST

In last week’s AI logs we discussed AI’s energy usage and how it was becoming a bottleneck for growth. We got to explore this much more deeply in our latest Lexicon Podcast with Jay Jiang Yu and James Walker of Nano Nuclear Energy, covering the growth of energy demand and how nuclear could be the only viable answer.

AI PICTURE OF THE WEEK

GPT-4o can now render 3D animation from a prompt

TUTORIAL

Exploring GPT-4o

Less of a tutorial this week and more an exploration of what GPT-4o is capable of, especially in comparison to previous models.

We start with a simple prompt, ‘give me a demonstration of something incredible [GPT model] can do that you couldn't do before’.

The examples in the image above show the responses from GPT-3.5 and GPT-4. Sure, they answered the question, but it wasn’t really a demonstration.

Now let’s look at how GPT-4o responded…

Certainly! Here's a demonstration of a powerful new capability in GPT-4.0: its advanced multimodal understanding and generation. This feature allows GPT-4.0 to process and generate content across multiple data types—text, images, and audio—seamlessly. Let's illustrate this with a scenario:

Scenario: Interactive Multimodal Assistant for Virtual Tours

Use Case: Imagine you're planning a trip to a historical city, and you want to learn about key landmarks and their history. With GPT-4.0, you can interact in a multimodal manner, using both text and images.

1.User Query (Text Input):

"Tell me about the Colosseum in Rome. Show me an image and provide an audio description as well."

2.GPT-4.0 Response:

Text Response:
"The Colosseum, also known as the Flavian Amphitheatre, is an ancient amphitheater in the center of Rome, Italy. Built of concrete and sand, it is the largest amphitheater ever built and is considered one of the greatest works of architecture and engineering."
Image Generation:
(An image of the Colosseum is generated and displayed here.)
Audio Description:
(An audio clip is played describing the Colosseum.)
- Audio Script: "The Colosseum, an iconic symbol of Imperial Rome, was used for gladiatorial contests and public spectacles such as animal hunts, executions, and dramas based on classical mythology. Constructed between AD 70-80 under the emperors Vespasian and Titus, it could hold up to 80,000 spectators."

Steps to Demonstrate:

1.User Interaction:

Text Input: The user asks about a specific landmark.
Image Upload: The user can also upload an image for the model to analyze and provide information.

2.Model Processing:

GPT-4.0 processes the text input and generates a detailed response.
It generates an accurate image of the landmark if requested.
It produces an audio description using natural-sounding text-to-speech technology.

3.Output Delivery:

The model returns a detailed text response, an image, and an audio clip, all synchronized to provide a comprehensive multimedia experience.

Technical Details:

Multimodal Encoder-Decoder Architecture:
GPT-4.0 uses a unified encoder-decoder architecture that enables it to handle and integrate multiple data types efficiently.
Advanced Text-to-Speech (TTS):
The model includes state-of-the-art TTS capabilities, allowing it to generate high-quality audio outputs that are clear and engaging.
Image Generation:
The model leverages advanced GANs (Generative Adversarial Networks) to create realistic images based on text descriptions.

Example Implementation:

Let's provide a brief code snippet showing how this can be implemented using GPT-4.0's API:

import openai

# Initialize the GPT-4.0 model
openai.api_key = 'your-api-key'

# Define the text query
text_query = "Tell me about the Colosseum in Rome. Show me an image and provide an audio description as well."

# Request for text, image, and audio response
response = openai.Completion.create(
  model="gpt-4.0-multimodal",
  prompt=text_query,
  n=1,
  stop=None,
  temperature=0.7
)

# Extract and display the responses
text_response = response['choices'][0]['text']
image_response = response['choices'][0]['image']
audio_response = response['choices'][0]['audio']

print("Text Response:", text_response)
# Display image and play audio (implementation will depend on your environment)

This demonstration showcases how GPT-4.0 can seamlessly integrate multiple forms of media to provide a rich, interactive experience that was not possible with previous versions. This capability can be applied in various fields, including education, tourism, customer support, and more, offering an unprecedented level of engagement and utility.

Impressive stuff. The understanding of intent and detail in GPT-4o’s response looks many steps beyond the previous models. We’ll be exploring GPT-4o much more in future editions.

Additional Reads

🚨 The Blueprint: IE's daily engineering, science & tech bulletin.

⚙️ Mechanical:Explore the wonders of mechanical engineering.

🛩️ Aerospace: The latest on propulsion, satellites, aeronautics, and more.

🧑‍🔧 Engineer Pros: For expert advice on engineering careers intelligence.

🎬 IE Originals:Weekly round-up of our best science, tech & engineering videos.

🟩 Sustainability: Uncover green innovations and the latest trends shaping a sustainable future for the tech industry.

⚡Electrical: From AI to smart grids, our newsletter energizes you on emerging tech.

Want to share your feedback? [email protected]