Listen to this article
Getting your Trinity Audio player ready...

You’ve been prompting AI with text. That works. It gets the job done.

But it’s a little like giving someone directions using only words when you could just show them a map.

That’s what multimodal prompting is. Showing the map. And for marketing teams trying to build campaigns that connect with real people — not just produce content — it changes what’s possible.

Multimodal prompting means combining more than one type of input — text, images, audio — in the same prompt. Instead of describing what you want in words alone, you show the AI what you mean.

Regular prompting: “Write campaign copy for a skincare brand targeting women in their 40s.”

Multimodal prompting: Same brief — plus a reference photo of your actual customer, a mood board from your last brand shoot, and your colour palette. The AI can now see who you’re talking to.

The output is a different category of result. More specific. More grounded. More usable on first pass.

The strongest case for multimodal prompting isn’t a theory — it’s what major brands are already doing with it at scale. Here is what the evidence looks like in practice.

H&M created digital twins of 30 real-world models — hyper-realistic AI versions that could pose, move, and adapt across multiple channels and formats. The key decision that made it work: they disclosed their use of AI-generated imagery and maintained ongoing relationships with the human models whose likenesses were used.

What H&M did: Created 30 AI digital twins of real models for use across ads, ecommerce, and social. Disclosed AI use openly and kept human model relationships intact.

Result: Delivered far more creative variations for testing and localisation. A campaign previously requiring dozens of photoshoots could be executed with one initial shoot plus AI-generated extensions.

The lesson for marketing teams: AI-generated imagery works at scale when it is anchored in real humans, not used to replace them. Transparency with your audience is not optional — it is the reason the campaign holds.

Coca-Cola built a campaign platform that invited fans to co-create visuals using generative AI — turning the brand’s audience into active creative participants. The campaign used both visual and text inputs to generate personalised brand imagery at a scale no traditional production model could match.

What Coca-Cola did: Built a generative AI platform allowing fans to co-create campaign visuals using their own inputs alongside brand assets.

Result: Delivered personalised campaign content at audience scale, driving engagement through creative participation rather than one-way broadcast.

The lesson for marketing teams: When audiences contribute their own visual references, the output feels personal — because it was partly shaped by them. That is the multimodal principle applied at its most human.

Nike used AI to tailor campaign creative for specific subcultures — moving beyond broad demographic targeting to visuals and messaging that reflected the distinct identity of each community. Rather than one campaign adapted across audiences, they generated genuinely different creative rooted in the visual language of each group.

What Nike did: Nike used AI to analyze massive datasets — including purchase history and social media interactions — to create highly personalized dynamic content, allowing them to tailor creative content for specific subcultures like runners and sneaker lovers. For example, “sneakerheads” may see high-contrast, texture-focused imagery, while “runners” receive data-heavy ads showing heat maps and biomechanical gains.

Result: Campaign creative that felt native to each audience rather than adapted from a single global template — driving stronger cultural resonance and relevance.

The lesson for marketing teams: Multimodal prompting lets you brief the AI with visual references drawn directly from the community you’re speaking to. The output reflects their world, not a generic version of it.

The rule that underpins every brief: lead with the person first. The product second. The more human detail you provide, the more human the output.

1. Lead with the person, not the product
The most common mistake is opening the prompt with the product. Flip it. Start with the human — their age, their setting, their emotional state — and let the product live in context around
them.

Instead of this :: “Generate an image of our skincare product.”
Try this :: “Generate a lifestyle image of a woman in her late 40s, natural lighting, in a bathroom that feels lived-in and warm — not aspirational or sterile. She looks calm and unhurried. The product is present but not the focus.”
2. Use reference images to anchor diversity and authenticity
Don’t describe your audience in text alone — show the AI. Input reference photos that represent the real range of people you’re trying to reach: different ages, skin tones, body types, environments.

Think of it like briefing a photographer. You wouldn’t just say “make it diverse.” You’d show them a casting board. Visual reference gives the AI far more precision than written description when it comes to human representation — this is exactly the approach H&M built their workflows around.
3. Match the emotional brief, not just the visual brief
Great campaign imagery makes people feel something before they read a word. Include the emotional state you’re after alongside the visual specs.

Try: “The expression should feel like quiet confidence, not performative happiness” or “the scene should feel like a real Tuesday morning, not a Sunday in an ad.”
4. Evaluate for humanity, not just accuracy
When reviewing outputs, don’t just check whether the brief was technically met. Ask the harder questions: Does this person look like they exist? Does the setting feel inhabited? Would someone stop scrolling for this? Does this look like our actual audience or a version of them from a stock photo library?
5. Iterate toward authenticity
If outputs feel too polished or too generic, get specific with your iteration notes:
Too polished: add “natural lighting with slight shadows, not studio-lit”.
Too generic: add “a kitchen in a mid-sized city apartment, not a design showroom”.
Too posed: add “candid, mid-laugh, not composed”. Wrong colour energy: add “full saturation — vivid and punchy, nothing muted or beige”.

Here is a ready-to-use template built for a human-first, high-saturation, high-attitude brief. Copy it. Fill in the brackets. Run it.

I’m attaching [mood board / reference photo / brand shoot outtake].  

Using these as your visual reference, generate a campaign image with the following brief:  

PERSON: – Age range: [e.g. late 20s] – Energy: confident, self-possessed, a little attitude —   they know exactly who they are – Expression: direct eye contact or caught mid-movement —   owning it, not smiling for the camera – Style: [e.g. urban chic — intentional, effortless not try-hard]  

SETTING: – Environment: [e.g. city street / rooftop / graphic architectural   backdrop / subway platform with personality] – Lighting: punchy and directional — bright daylight,   neon reflection, or dramatic shadow play – Feel: a city that is alive, not a city that is a backdrop  

COLOUR: – Background: full saturation — [deep cobalt / electric yellow /   vivid emerald / saturated terracotta] — bold, not muted – Palette: high contrast and graphic — colours that compete   for attention and win – Avoid: pastels / washed-out tones / anything beige  

ENERGY: – Overall mood: fun, magnetic — the kind of image that   stops a scroll cold – Should feel like: editorial advertising meets street photography – Should NOT feel like: stock / safe / generic brand shoot  

PRODUCT / MESSAGE: – Appears: [naturally integrated / held with attitude /   part of the scene] – Prominence: [hero / co-lead with the person / supporting]  

TECHNICAL: – Format: [1:1 square / 4:5 portrait / 16:9 banner] – Style: high-end advertising photography, editorial confidence – Colour grading: vivid and punchy, fully saturated, high contrast – Avoid: over-retouched skin / artificial softening / beige  

GOAL: – The person seeing this should think: [“that brand gets me” /   “I want to be in that world” /   “I need to stop and look at this”]

The right tool depends on where you are in the creative process and how much production control you need.

ToolWhy it works for this briefStarting cost
MidjourneyWidely regarded as producing the strongest artistic quality and mood for campaign hero images. Its Style Reference feature lets you upload a mood board and maintain that aesthetic consistently across an entire campaign.From $10/month
Adobe FireflyBest for teams already in the Adobe ecosystem. Integrates directly with Photoshop and Illustrator. Built on licensed content, making it commercially safe for campaign production.Included in Creative Cloud
ReveStrong prompt accuracy for complex human briefs. Follows detailed instructions on face, expression, and setting closely. Good when the human detail in your brief is very specific.Free tier available
Canva AIBest for teams already working in Canva. Upload reference images, generate, and adapt to multiple formats in one workflow. A practical starting point before committing to a specialist tool.Free tier / Pro from $15/month
GPT-4oGood for fast concept drafts and conversational prompt development. Better used as a starting point or ideation tool than as a final production asset generator.From $20/month

H&M is using it to extend campaign production across 30 real models without repeat photoshoots. I didnt include this example but Zalando cut eight-week production timelines down to days. Nike is using it to speak authentically to communities that would recognise a generic brief from a mile away.

The question isn’t whether it works. The evidence is there. The question is whether your team is set up to use it well — with real reference material, a human-first brief, and the discipline to evaluate outputs against how real people actually look and live.

Start with one reference image in your next creative prompt this week. Just one. You might be surprised how much more grounded the output becomes when the AI can actually see who you’re trying to reach.

Frequently Asked Questions ::

Can AI generate campaign images that feature real-looking people?

Yes — when you prompt it correctly. Drop in a mood board, a reference photo, or a brand shoot outtake. Brief the AI on the human energy, skin tones, age range, and emotional expression your campaign calls for. The result moves from sterile stock-photo aesthetics to visuals that look like the people you’re actually trying to reach.

How do you turn audience personas into visual campaign direction?

Input your persona documentation alongside reference images. Ask the AI to generate campaign imagery that puts those specific people front and center — their environments, their expressions, their actual lives. Not a generic smiling person on a white background.

How do you write copy that responds to a real human moment?

Drop in a photo from a customer event, a behind-the-scenes shoot, or a community moment. Ask the AI to write campaign copy that responds to the actual emotion in the image. The result is copy that feels lived-in — because it was anchored in something real.

Can you turn customer audio into campaign language?

Yes. A testimonial recording, a customer interview, a community call — add it to your prompt and ask the AI to surface quotes, themes, and emotional language for headlines and social posts. Real words from real people, found faster than any manual transcription process.