Listen to this article
Getting your Trinity Audio player ready...

You’ve been prompting AI with text. It works. It gets things done.

But it’s a bit like giving a photographer a written brief instead of a mood board and expecting them to nail it first time.

That’s the gap multimodal prompting closes. And for marketing teams trying to produce campaign creative that actually connects — not just exists — it changes what’s possible.

What Is Multimodal Prompting and Why Is Text-Only Holding Your Campaigns Back?

Multimodal prompting means combining more than one type of input – text, images, audio – in the same prompt. Instead of describing what you want in words, you show the AI what you mean.

“Write campaign copy for a skincare brand targeting women in their 40s.”

Same brief – plus a reference photo of your actual customer, outtakes from your last brand shoot, and your colour palette. The AI can now see who you’re talking to.

That’s not a subtle upgrade. That’s a different category of output. More specific. More grounded. Usable on first pass, not fourth revision.

How H&M, Nike + Coca-Cola Are Already Using Multimodal AI at Scale

The case for multimodal prompting isn’t theoretical. The evidence is already in market.

In 2024, H&M created digital twins of 30 real-world models – hyper-realistic AI versions capable of posing, moving, and adapting across multiple channels and formats (H&M Group, 2024). They disclosed the use of AI-generated imagery publicly and maintained ongoing commercial relationships with the human models whose likenesses were used.

This resulted in campaign variations that previously required dozens of separate shoots could be executed with one initial session and AI-generated extensions. Localisation, format adaptation, seasonal updates – all running from the same creative foundation.

Coca-Cola’s Create Real Magic platform (2023) invited fans to generate campaign visuals using the brand’s own assets and generative AI – turning passive audiences into active creative participants. Both text and visual inputs drove personalised brand imagery at a scale no traditional production model could match (The Drum, 2023).

The lesson for your team: When your audience contributes their own visual references, the output feels personal – because it partly was. That’s the multimodal principle applied at its most human.

Nike uses AI to analyse purchase history, social data, and community signals to generate creative that reflects the distinct visual language of each subculture – not a global template adapted across audiences (Fast Company, 2024). A sneakerhead sees high-contrast, texture-heavy imagery. A runner sees biomechanical data visualised. Same brand. Genuinely different creative.

The lesson for your team: Multimodal prompting lets you brief the AI with visual references drawn directly from the community you’re speaking to. The output reflects their world – not a generic version of it.

How to Write a Multimodal Prompt That Gets Human-Centred Campaign Imagery

One rule underpins every brief: lead with the person first. The product second. The more human detail you provide, the more human the output.

The most common mistake is leading with what you’re selling. Flip it. Start with the human – their age, their setting, their emotional state – and let the product exist in context.

Instead of: “Generate an image of our skincare product.”

Try: Generate a lifestyle image of a woman in her late 40s, natural lighting, in a bathroom that feels lived-in and warm — not aspirational or sterile. She looks calm and unhurried. The product is present but not the focus.

Don’t describe your audience in words. Show the AI. Input reference photos that represent the real range of people you’re trying to reach – different ages, skin tones, environments.

Think of it like briefing a photographer. You wouldn’t say “make it diverse.” You’d show a casting board. This is exactly the workflow H&M built their production system around (H&M Group, 2024).

Great campaign imagery makes people feel something before they read a word. Include the emotional state you’re after alongside the visual specs.

Try: The expression should feel like quiet confidence, not performative happiness” or “this should look like a real Tuesday morning, not a Sunday in an ad.

When reviewing outputs, don’t just check whether the brief was technically met. Ask the harder questions: Does this person look like they exist? Does the setting feel inhabited? Would someone stop scrolling for this? Or does it look like your actual audience’s more polished cousin from a stock photo library?

Too polished“natural lighting with slight shadows, not studio-lit”

Too generic“a kitchen in a mid-sized city apartment, not a design showroom”

Too posed“candid, mid-laugh, not composed”

Wrong colour energy “full saturation – vivid and punchy, nothing muted or beige”

The Multimodal Prompt Template Your Team Can Use This Week

Copy this. Fill in the brackets. Run it.

I’m attaching [mood board / reference photo / brand shoot outtake].

Using these as your visual reference, generate a campaign image with the following brief:

PERSON:
– Age range: [e.g. late 20s]
– Energy: confident, self-possessed, a little attitude — they know exactly who they are
– Expression: direct eye contact or caught mid-movement — owning it, not smiling for the camera
– Style: [e.g. urban chic — intentional, effortless not try-hard]

SETTING:
– Environment: [e.g. city street / rooftop / graphic architectural backdrop]
– Lighting: punchy and directional — bright daylight, neon reflection, or dramatic shadow play
– Feel: a city that is alive, not a city that is a backdrop

COLOR:
– Background: full saturation — [deep cobalt / electric yellow / vivid emerald]
– Palette: high contrast and graphic
– Avoid: pastels / washed-out tones / anything beige

ENERGY:
– Overall mood: fun, magnetic — the kind of image that stops a scroll cold
– Should feel like: editorial advertising meets street photography
– Should NOT feel like: stock / safe / generic brand shoot

PRODUCT:
– Appears: [naturally integrated / held with attitude / part of the scene]
– Prominence: [hero / co-lead with the person / supporting]

TECHNICAL:
– Format: [1:1 / 4:5 / 16:9]
– Style: high-end advertising photography, editorial confidence
– Colour grading: vivid and punchy, fully saturated, high contrast
– Avoid: over-retouched skin / artificial softening / beige

GOAL:
– The person seeing this should think:
[“that brand gets me” / “I want to be in that world” / “I need to stop and look at this”]


Frequently Asked Questions ::

Can AI generate campaign images that feature real-looking people?

Yes — when you prompt it correctly. Drop in a mood board, a reference photo, or a brand shoot outtake. Brief the AI on the human energy, skin tones, age range, and emotional expression your campaign calls for. The result moves from sterile stock-photo aesthetics to visuals that look like the people you’re actually trying to reach.

How do you turn audience personas into visual campaign direction?

Input your persona documentation alongside reference images. Ask the AI to generate campaign imagery that puts those specific people front and center — their environments, their expressions, their actual lives. Not a generic smiling person on a white background.

How do you write copy that responds to a real human moment?

Drop in a photo from a customer event, a behind-the-scenes shoot, or a community moment. Ask the AI to write campaign copy that responds to the actual emotion in the image. The result is copy that feels lived-in — because it was anchored in something real.

Can you turn customer audio into campaign language?

Yes. A testimonial recording, a customer interview, a community call — add it to your prompt and ask the AI to surface quotes, themes, and emotional language for headlines and social posts. Real words from real people, found faster than any manual transcription process.

What’s the difference between multimodal prompting and regular AI prompting?

Regular prompting gives the AI instructions in text. Multimodal prompting gives it instructions and visual context – mood boards, reference photos, brand assets. The output is categorically more specific and closer to usable on the first pass.

Sources

  • McKinsey Global Institute. The Economic Potential of Generative AI. 2023. mckinsey.com
  • Gartner. Predicts 2024: AI and the Future of Marketing. 2024.
  • H&M Group. H&M Group and AI – Exploring the Future of Fashion. 2024. hmgroup.com
  • Coca-Cola / OpenAI. Create Real Magic Campaign. The Drum, 2023.
  • Fast Company. How Nike Uses AI to Speak to Sneakerheads and Runners Differently. 2024.

Last Updated: February 12, 2026. Updated April 1, 2026.
Reading Time: 13 minutes

Author: Nicola Ziady
Title: Chief Marketing Officer
Published: 20 February 2026 | Updated: 4 April 2026
URL: https://nicolaziady.com/multimodal-prompting-marketing-campaigns/

About the Author

Nicola Ziady is a Chief Marketing Officer with 20 years of experience building strategies for leading healthcare and academic institutions including Cleveland Clinic and St. Jude Children’s Research Hospital.

She built The 5 Shifts Framework from two decades of watching which marketers get promoted — and which ones don’t. The difference wasn’t talent or effort. It was how they thought about their work.

On this blog, you’ll find practical frameworks on making those shifts: from tactics to strategy, from reacting to anticipating, from tools to systems, from managing to multiplying, and from data to insight.

Connect with Nicola on LinkedIn.