Imagen Guide

Imagen is an advanced series of image generation AI models developed by Google, capable of creating high-quality, realistic images based on text prompts. This guide will help you understand how to use the Imagen API to generate images, including parameter settings, model selection, and code examples.

Available Models:

  • imagen-4.0-generate-preview-05-20
  • imagen-4.0-ultra-generate-exp-05-20
  • imagen-3.0-generate-002
  1. Currently, Imagen only supports English prompts. When integrating, it’s recommended to add automatic translation to allow users to use it without language barriers.
  2. Performance is unstable when rendering large amounts of text. It’s recommended to only include key keywords.

Model Parameters

Imagen currently only supports English prompts and provides the following parameters:

  • numberOfImages: The number of images to generate, ranging from 1 to 4 (inclusive). The default value is 4.
  • imagen-4.0-ultra-generate-exp-05-20 can only generate 1 image at a time.
  • aspectRatio: Changes the aspect ratio of the generated images. Supported values are “1:1”, “3:4”, “4:3”, “9:16”, and “16:9”. The default value is “1:1”.
  • personGeneration: Allows the model to generate images of people. Supports the following values:
    • “DONT_ALLOW”: Prevents the generation of images containing people.
    • “ALLOW_ADULT”: Generates images of adults but not children. This is the default value.

Usage Pricing

The cost of using the Imagen API to generate images is $0.03 per image. Please note that each API call can generate 1-4 images, and you will be charged based on the actual number of images generated.

API Call Example

Here’s a Python example of generating images using Imagen 3.0:

import os
import time
from google import genai
from google.genai import types
from PIL import Image
from io import BytesIO

client = genai.Client(
    api_key="sk-***", # 🔑 Replace with your key generated on AiHubMix
    http_options={"base_url": "https://aihubmix.com/gemini"},
)

# Currently only supports English prompts, performance is poor with large amounts of text
response = client.models.generate_images(
    model='imagen-3.0-generate-002',
    prompt='A minimalist logo for a LLM router market company on a solid white background. trident in a circle as the main symbol, with ONLY text \'InferEra\' below.',
    config=types.GenerateImagesConfig(
        number_of_images=1,
        aspect_ratio="1:1", # supports "1:1", "9:16", "16:9", "3:4", or "4:3".
    )
)

script_dir = os.path.dirname(os.path.abspath(__file__))
output_dir = os.path.join(script_dir, "output")

os.makedirs(output_dir, exist_ok=True)

# Generate timestamp as filename prefix to avoid filename conflicts
timestamp = int(time.time())

# Save and display the generated images
for i, generated_image in enumerate(response.generated_images):
  image = Image.open(BytesIO(generated_image.image.image_bytes))
  image.show()
  
  file_name = f"imagen3_{timestamp}_{i+1}.png"
  file_path = os.path.join(output_dir, file_name)
  image.save(file_path)
  
  print(f"Image saved to: {file_path}")

Prompt Tips

Creating effective prompts is crucial for obtaining desired images:

  • Use detailed descriptions, including subject, style, lighting, angle, etc.
  • Specify artistic styles (such as cinematic, photorealistic, anime style, etc.).
  • Include technical details (such as DSLR, high-definition, rich in detail, etc.).
  • Avoid negative or prohibited content.
  • Avoid including large amounts of text in prompts, only use key keywords for more stable results.

Gemini Image Generation

Gemini also offers image generation capabilities as an alternative. Compared to Imagen, Gemini’s image generation is better suited for scenarios that require contextual understanding and reasoning, rather than pursuing ultimate artistic expression and visual quality.

Instructions:

  • Model id: gemini-2.0-flash-preview-image-generation
  • Input/Outpt pricing: 0.10.1→0.4/M tokens
  • Need to add parameters to experience new features: "modalities":["text","image"]
  • Images are passed and output in Base64 encoding
  • As an experimental model, it’s recommended to explicitly specify “output image”, otherwise it might only output text
  • Default height for output images is 1024px
  • Python calls require the latest OpenAI SDK, run pip install -U openai first
  • For more information, visit the Gemini official documentation

Input Reference Structure:

"modalities": ["text","image"]
{
    "model": "gemini-2.0-flash-preview-image-generation",
    "messages": [
      {
        "role": "user",
        "content": "Generate a landscape painting and provide a poem to describe it"
      }
    ],
    "modalities":["text","image"], //need to add image
    "temperature": 0.7
  }'

Output Reference Structure:

"choices":
    [
        {
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "Hello! How can I assist you today?",
                "refusal": null,
                "multi_mod_content": //📍 New addition
                [
                    {
                        "text": "",
                        "inlineData":
                        {
                          "data":"base64 str",
                          "mimeType":"png"
                        }
                    },
                    {
                        "text": "hello",
                        "inlineData":
                        {
                        }
                    }
                ],
                "annotations":
                []
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],

Text-to-Image Generation

Input: text Output: text + image

IMG_PATH="/your_path/image.jpg"

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi

IMG_BASE64=$(base64 "$B64FLAGS" "$IMG_PATH" 2>&1)

curl https://aihubmix.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-***" \
  -d '{
    "model": "gemini-2.0-flash-preview-image-generation",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type":"text",
            "text":"describe the image with a concise and engaging paragraph, then fill color as children's crayon style"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,'$IMG_BASE64'"
            }
          }
        ]
      }
    ],
    "modalities": ["text","image"],
    "temperature": 0.7
}' \
  | grep -o '"data":"[^"]*"' \
  | cut -d'"' -f4 \
  | base64 --decode > /your_path/imageGen.jpg

Output Example:

Edit Image

Input: text + image
Output: text + image

import os
from openai import OpenAI
from PIL import Image
from io import BytesIO
import base64

client = OpenAI(
    api_key="sk-***", # 🔑 Replace with your key generated on AiHubMix
    base_url="https://aihubmix.com/v1",
)

project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

image_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources", "filled.jpg")
if not os.path.exists(image_path):
    raise FileNotFoundError(f"image {image_path} not exists")

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gemini-2.0-flash-preview-image-generation",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "describe the image with a concise and engaging paragraph, then fill color as children's crayon style",
                },
                {
                    "type": "image_url", 
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },     
            ],
        },
    ],
    modalities=["text", "image"],
    temperature=0.7,
)
try:
    # Print basic response information without base64 data
    print(f"Creation time: {response.created}")
    print(f"Token usage: {response.usage.total_tokens}")
    
    # Check if multi_mod_content field exists
    if (
        hasattr(response.choices[0].message, "multi_mod_content")
        and response.choices[0].message.multi_mod_content is not None
    ):
        print("\nResponse content:")
        for part in response.choices[0].message.multi_mod_content:
            if "text" in part and part["text"] is not None:
                print(part["text"])
            
            # Process image content
            elif "inline_data" in part and part["inline_data"] is not None:
                print("\n🖼️ [Image content received]")
                image_data = base64.b64decode(part["inline_data"]["data"])
                mime_type = part["inline_data"].get("mime_type", "image/png")
                print(f"Image type: {mime_type}")
                
                image = Image.open(BytesIO(image_data))
                image.show()
                
                # Save image
                output_dir = os.path.join(os.path.dirname(image_path), "output")
                os.makedirs(output_dir, exist_ok=True)
                output_path = os.path.join(output_dir, "edited_image.jpg")
                image.save(output_path)
                print(f"✅ Image saved to: {output_path}")
            
    else:
        print("No valid multimodal response received, check response structure")
except Exception as e:
    print(f"Error processing response: {str(e)}")

Output Example:

Choosing the Right Model

When to Choose Gemini:

  • When you need to leverage world knowledge and reasoning abilities to generate contextually relevant images.
  • When you need seamless integration of text and images.
  • When you want to embed accurate visual content in long text sequences.
  • When you want to edit images conversationally while maintaining context.

When to Choose Imagen:

  • When image quality, photorealism, artistic detail, or specific styles (such as impressionism, anime) are the primary considerations.
  • When performing professional editing tasks, such as product background updates or image enlargement.
  • When injecting branding, style, or generating logos and product designs.

Best Practices

  1. Optimize prompts: Carefully crafting prompts is key to obtaining high-quality output.
  2. Experiment with parameters: Try different aspect ratios and settings to find the configuration that best suits your needs.
  3. Batch generation: Generate multiple images to increase the chance of getting ideal results.
  4. Save metadata: Save prompts and timestamps along with images to track and replicate successful results.
  5. Comply with usage policies: Ensure your usage complies with Google’s content policies and terms of service.

Veo 2.0 Video Generation

VEO 2.0 is an advanced video generation AI model launched by Google, capable of creating high-quality, realistic short videos based on text prompts. This part will help you understand how to use the VEO 2.0 API to generate videos, including parameter settings, model selection, and code examples.

  1. Currently, VEO 2.0 only supports English prompts
  2. Video generation takes approximately 2-3 minutes, please be patient

Model Parameters

VEO 2.0 provides the following parameters:

  • numberOfVideos: The number of videos to generate, options are 1 or 2. Default is 2.
  • aspectRatio: The aspect ratio of the generated videos. Supported values are “16:9” and “9:16”.
  • durationSeconds: Video duration, options are 5 seconds or 8 seconds. Default is 8 seconds.
  • personGeneration: Controls whether to allow videos containing people. Supports the following values:
    • “dont_allow”: Prevents generation of videos containing people.
    • “allow_adult”: Allows generation of videos containing adults, but not children.

Pricing

The cost of the VEO 2.0 API is $0.35/s

Usage Example

Here’s a Python example of using VEO 2.0 to generate videos:

import os
import time
from google import genai
from google.genai import types

client = genai.Client(
    api_key="sk-***", # 🔑 Replace with your key generated on AiHubMix
    http_options={"base_url": "https://aihubmix.com/gemini"},
)

operation = client.models.generate_videos(
    model="veo-2.0-generate-001",
    prompt="Panning wide shot of a calico kitten sleeping in the sunshine",
    config=types.GenerateVideosConfig(
        person_generation="dont_allow",  # "dont_allow" or "allow_adult"
        aspect_ratio="16:9",  # "16:9" or "9:16"
        number_of_videos=1, # Integer, options are 1 or 2, default is 2
        durationSeconds=5, # Integer, options are 5 or 8, default is 8
    ),
)

# Takes 2-3 minutes, video duration is 5-8s
while not operation.done:
    time.sleep(20)
    operation = client.operations.get(operation)

for n, generated_video in enumerate(operation.response.generated_videos):
    client.files.download(file=generated_video.video)
    generated_video.video.save(f"video{n}.mp4")  # Save the video

Prompt Tips

Creating effective prompts is crucial for obtaining desired videos:

  • Describe clear scenes, actions, and atmosphere
  • Specify filming styles (such as panoramic, close-up, tracking shots, etc.)
  • Describe lighting conditions (such as sunny, dusk, indoor lighting, etc.)
  • Specify the main subject and its actions (e.g., “a kitten sleeping in the sunshine”)
  • Avoid overly complex narratives or rapidly changing scenes
  • Avoid negative or prohibited content

Best Practices

  1. Clear and concise prompts: Use clear, specific descriptions to guide video generation.
  2. Patience is key: Video generation takes 2-3 minutes, please wait for completion.
  3. Test different parameters: Try different aspect ratios and durations to find the settings that best suit your needs.
  4. Save generation records: Record prompts along with generated videos to track successful results.
  5. Comply with usage policies: Ensure your usage complies with Google’s content policies and terms of use.