JoyAI-Image-Edit

JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing.

JoyAI-Image-Edit supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control.

Model	Description	Download
JoyAI-Image-Edit	Instruction-guided image editing with precise and controllable spatial manipulation	Hugging Face

import torch
from diffusers import JoyImageEditPipeline
from diffusers.utils import load_image

pipeline = JoyImageEditPipeline.from_pretrained(
    "jdopensource/JoyAI-Image-Edit-Diffusers", torch_dtype=torch.bfloat16
)
pipeline.to("cuda")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg")
prompt = "Add wings to the astronaut."

output = pipeline(
    image=image,
    prompt=prompt,
    num_inference_steps=40,
    guidance_scale=4.0,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
output.save("joyimage_edit_output.png")

Spatial editing

JoyAI-Image supports three spatial editing prompt patterns: Object Move, Object Rotation, and Camera Control. For best results, follow the prompt templates below as closely as possible. For more information, refer to SpatialEdit.

Object Move

Move a target object into a specified region marked by a red box in the input image.

Move the <object> into the red box and finally remove the red box.

Object Rotation

Rotate an object to a specific canonical view. Supported <view> values: front, right, left, rear, front right, front left, rear right, rear left.

Rotate the <object> to show the <view> side view.

Camera Control

Change the camera viewpoint while keeping the 3D scene unchanged.

Move the camera.
- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
- Camera zoom: in/out/unchanged.
- Keep the 3D scene static; only change the viewpoint.

JoyImageEditPipeline

class diffusers.JoyImageEditPipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLWan text_encoder: Qwen3VLForConditionalGeneration tokenizer: Qwen2Tokenizer transformer: JoyImageEditTransformer3DModel processor: Qwen3VLProcessor text_token_max_length: int = 2048 )

Diffusion pipeline for image editing using the JoyImage architecture.

The pipeline encodes text and image conditioning via a Qwen3-VL text encoder, denoises latents with a 3-D transformer, and decodes the result with a WAN VAE.

Model offloading order: text_encoder -> transformer -> vae.

call

< source >

( image: PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] | None = None prompt: str | list[str] = None height: int | None = None width: int | None = None num_inference_steps: int = 40 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 4096 enable_denormalization: bool = True ) → [~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.Tensor

Parameters

prompt (str or List[str]) — The prompt or prompts to guide generation.
height (int) — Height of the generated output in pixels.
width (int) — Width of the generated output in pixels.
image (PipelineImageInput, optional) — Reference image used for conditioning. When provided the pipeline operates in image-editing mode with num_items=2.
num_inference_steps (int, optional, defaults to 40) — Number of denoising steps. More steps generally improve quality at the cost of slower inference.
timesteps (List[int], optional) — Custom timesteps for the denoising process. When provided, num_inference_steps is inferred from the list length.
sigmas (List[float], optional) — Custom sigmas for the denoising process. Mutually exclusive with timesteps.
guidance_scale (float, optional, defaults to 4.0) — Classifier-free guidance scale.
negative_prompt (str or List[str], optional) — Negative prompt(s) used to suppress undesired content.
num_images_per_prompt (int, optional, defaults to 1) — Number of generated samples per prompt.
generator (torch.Generator or List[torch.Generator], optional) — RNG generator(s) for deterministic sampling.
latents (torch.Tensor, optional) — Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.
prompt_embeds (torch.Tensor, optional) — Pre-computed prompt embeddings. When provided prompt can be omitted.
prompt_embeds_mask (torch.Tensor, optional) — Attention mask for prompt_embeds.
negative_prompt_embeds (torch.Tensor, optional) — Pre-computed negative prompt embeddings.
negative_prompt_embeds_mask (torch.Tensor, optional) — Attention mask for negative_prompt_embeds.
output_type (str, optional, defaults to "pil") — Output format. Pass "latent" to return raw latents.
return_dict (bool, optional, defaults to True) — Whether to return a JoyImageEditPipelineOutput or a plain tensor.
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — Callback invoked at the end of each denoising step with signature (self, step: int, timestep: int, callback_kwargs: Dict).
callback_on_step_end_tensor_inputs (List[str], optional, defaults to ["latents"]) — Tensor keys included in callback_kwargs for callback_on_step_end.
max_sequence_length (int, optional, defaults to 4096) — Maximum sequence length for prompt encoding.
enable_denormalization (bool, optional, defaults to True) — Denormalise latents before VAE decoding.

Returns

[~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.Tensor

If return_dict is True, returns a pipeline output object containing the generated image(s). Otherwise returns the image tensor directly.

Generate an edited image conditioned on a reference image and a text prompt.

Examples:

>>> import torch
>>> from diffusers import JoyImageEditPipeline
>>> from diffusers.utils import load_image

>>> model_id = "jdopensource/JoyAI-Image-Edit-Diffusers"
>>> pipe = JoyImageEditPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/astronaut.jpg")
>>> output = pipe(
...     image=image,  # pass an image for editing; omit for text-to-image generation
...     prompt="Add wings to the astronaut.",
...     num_inference_steps=40,
...     guidance_scale=4.0,
...     generator=torch.manual_seed(0),
... )
>>> output.images[0].save("joyimage_edit.png")

check_inputs

< source >

( prompt height width negative_prompt = None prompt_embeds = None negative_prompt_embeds = None prompt_embeds_mask = None negative_prompt_embeds_mask = None callback_on_step_end_tensor_inputs = None )

Raises

ValueError

ValueError — On any invalid combination of arguments.

Validate pipeline inputs before the forward pass.

denormalize_latents

< source >

( latent: Tensor )

Parameters

latent — Normalised latent tensor.

Invert normalize_latents to recover the original latent scale.

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 1024 template_type: str = 'image' )

Parameters

prompt — Prompt string or list of prompt strings.
device — Target device.
num_images_per_prompt — Number of outputs to generate per prompt.
prompt_embeds — Pre-computed prompt embeddings.
prompt_embeds_mask — Attention mask for pre-computed embeddings.
max_sequence_length — Maximum output sequence length.
template_type — Prompt template key ("image" or "multiple_images").

Encode a text prompt into embeddings (text-only path).

Pre-computed prompt_embeds bypass encoding entirely.

encode_prompt_multiple_images

< source >

( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 images: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None template_type: typing.Optional[str] = 'multiple_images' max_sequence_length: typing.Optional[int] = None )

Parameters

prompt — Prompt string(s), optionally containing <image>\n tokens.
device — Target device.
num_images_per_prompt — Number of outputs to generate per prompt.
images — Pixel tensors corresponding to the inline image tokens.
prompt_embeds — Pre-computed prompt embeddings.
prompt_embeds_mask — Attention mask for pre-computed embeddings.
template_type — Must be "multiple_images".
max_sequence_length — If set, truncate the output to this length (keeping the last max_sequence_length tokens).

Encode prompts that contain inline image tokens via the Qwen processor.

<image>\n placeholders in each prompt string are replaced by the Qwen vision special tokens before being fed to the multimodal encoder.

normalize_latents

< source >

( latent: Tensor )

Parameters

latent — Raw latent tensor from vae.encode.

Normalise latents using per-channel statistics from the VAE config.

Uses (latent - mean) / std when the VAE exposes latents_mean and latents_std; otherwise falls back to scaling by scaling_factor.

prepare_latents

< source >

( batch_size: int num_channels_latents: int height: int width: int video_length: int dtype: dtype device: device generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] latents: typing.Optional[torch.Tensor] = None image: typing.Optional[typing.List[PIL.Image.Image]] = None enable_denormalization: bool = True )

Parameters

batch_size — Number of samples in the batch.
num_channels_latents — Latent channel dimension from the transformer config.
height — Spatial height in pixels.
width — Spatial width in pixels.
video_length — Number of frames (1 for image inference).
dtype — Floating-point dtype for the latent tensor.
device — Target device.
generator — RNG generator(s) for reproducible sampling.
latents — Optional user-provided initial noise for the target slot. When None random noise is sampled.
image — Optional list of PIL reference images to VAE-encode as conditioning slots.
enable_denormalization — Whether to normalise encoded reference latents.

Raises

ValueError

ValueError — If generator is a list whose length differs from batch_size.

Prepare the initial noisy latent tensor for the denoising loop.

JoyImageEditPipelineOutput

class diffusers.JoyImageEditPipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Output class for JoyImageEdit generation pipelines.

Update on GitHub

Diffusers

JoyAI-Image-Edit

Spatial editing

Object Move

Object Rotation

Camera Control

JoyImageEditPipeline

class diffusers.JoyImageEditPipeline

__call__

check_inputs

denormalize_latents

encode_prompt

encode_prompt_multiple_images

normalize_latents

prepare_latents

JoyImageEditPipelineOutput

class diffusers.JoyImageEditPipelineOutput

call