Diffusers documentation
JoyAI-Image-Edit
JoyAI-Image-Edit
JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing.
JoyAI-Image-Edit supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control.
| Model | Description | Download |
|---|---|---|
| JoyAI-Image-Edit | Instruction-guided image editing with precise and controllable spatial manipulation | Hugging Face |
import torch
from diffusers import JoyImageEditPipeline
from diffusers.utils import load_image
pipeline = JoyImageEditPipeline.from_pretrained(
"jdopensource/JoyAI-Image-Edit-Diffusers", torch_dtype=torch.bfloat16
)
pipeline.to("cuda")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg")
prompt = "Add wings to the astronaut."
output = pipeline(
image=image,
prompt=prompt,
num_inference_steps=40,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
output.save("joyimage_edit_output.png")Spatial editing
JoyAI-Image supports three spatial editing prompt patterns: Object Move, Object Rotation, and Camera Control. For best results, follow the prompt templates below as closely as possible. For more information, refer to SpatialEdit.
Object Move
Move a target object into a specified region marked by a red box in the input image.
Move the <object> into the red box and finally remove the red box.
Object Rotation
Rotate an object to a specific canonical view. Supported <view> values: front, right, left, rear, front right, front left, rear right, rear left.
Rotate the <object> to show the <view> side view.
Camera Control
Change the camera viewpoint while keeping the 3D scene unchanged.
Move the camera.
- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
- Camera zoom: in/out/unchanged.
- Keep the 3D scene static; only change the viewpoint.JoyImageEditPipeline
class diffusers.JoyImageEditPipeline
< source >( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLWan text_encoder: Qwen3VLForConditionalGeneration tokenizer: Qwen2Tokenizer transformer: JoyImageEditTransformer3DModel processor: Qwen3VLProcessor text_token_max_length: int = 2048 )
Diffusion pipeline for image editing using the JoyImage architecture.
The pipeline encodes text and image conditioning via a Qwen3-VL text encoder, denoises latents with a 3-D transformer, and decodes the result with a WAN VAE.
Model offloading order: text_encoder -> transformer -> vae.
__call__
< source >( image: PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] | None = None prompt: str | list[str] = None height: int | None = None width: int | None = None num_inference_steps: int = 40 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 4096 enable_denormalization: bool = True ) → [~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.Tensor
Parameters
- prompt (str or List[str]) — The prompt or prompts to guide generation.
- height (int) — Height of the generated output in pixels.
- width (int) — Width of the generated output in pixels.
- image (PipelineImageInput, optional) —
Reference image used for conditioning. When provided the pipeline operates in image-editing mode with
num_items=2. - num_inference_steps (int, optional, defaults to 40) — Number of denoising steps. More steps generally improve quality at the cost of slower inference.
- timesteps (List[int], optional) —
Custom timesteps for the denoising process. When provided,
num_inference_stepsis inferred from the list length. - sigmas (List[float], optional) —
Custom sigmas for the denoising process. Mutually exclusive with
timesteps. - guidance_scale (float, optional, defaults to 4.0) — Classifier-free guidance scale.
- negative_prompt (str or List[str], optional) — Negative prompt(s) used to suppress undesired content.
- num_images_per_prompt (int, optional, defaults to 1) — Number of generated samples per prompt.
- generator (torch.Generator or List[torch.Generator], optional) — RNG generator(s) for deterministic sampling.
- latents (torch.Tensor, optional) — Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.
- prompt_embeds (torch.Tensor, optional) —
Pre-computed prompt embeddings. When provided
promptcan be omitted. - prompt_embeds_mask (torch.Tensor, optional) —
Attention mask for
prompt_embeds. - negative_prompt_embeds (torch.Tensor, optional) — Pre-computed negative prompt embeddings.
- negative_prompt_embeds_mask (torch.Tensor, optional) —
Attention mask for
negative_prompt_embeds. - output_type (str, optional, defaults to
"pil") — Output format. Pass"latent"to return raw latents. - return_dict (bool, optional, defaults to True) — Whether to return a JoyImageEditPipelineOutput or a plain tensor.
- callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) —
Callback invoked at the end of each denoising step with signature
(self, step: int, timestep: int, callback_kwargs: Dict). - callback_on_step_end_tensor_inputs (List[str], optional, defaults to
["latents"]) — Tensor keys included incallback_kwargsforcallback_on_step_end. - max_sequence_length (int, optional, defaults to 4096) — Maximum sequence length for prompt encoding.
- enable_denormalization (bool, optional, defaults to True) — Denormalise latents before VAE decoding.
Returns
[~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.Tensor
If return_dict is True, returns a pipeline output object containing the generated image(s).
Otherwise returns the image tensor directly.
Generate an edited image conditioned on a reference image and a text prompt.
Examples:
>>> import torch
>>> from diffusers import JoyImageEditPipeline
>>> from diffusers.utils import load_image
>>> model_id = "jdopensource/JoyAI-Image-Edit-Diffusers"
>>> pipe = JoyImageEditPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/astronaut.jpg")
>>> output = pipe(
... image=image, # pass an image for editing; omit for text-to-image generation
... prompt="Add wings to the astronaut.",
... num_inference_steps=40,
... guidance_scale=4.0,
... generator=torch.manual_seed(0),
... )
>>> output.images[0].save("joyimage_edit.png")check_inputs
< source >( prompt height width negative_prompt = None prompt_embeds = None negative_prompt_embeds = None prompt_embeds_mask = None negative_prompt_embeds_mask = None callback_on_step_end_tensor_inputs = None )
Raises
ValueError
ValueError— On any invalid combination of arguments.
Validate pipeline inputs before the forward pass.
Invert normalize_latents to recover the original latent scale.
encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 1024 template_type: str = 'image' )
Parameters
- prompt — Prompt string or list of prompt strings.
- device — Target device.
- num_images_per_prompt — Number of outputs to generate per prompt.
- prompt_embeds — Pre-computed prompt embeddings.
- prompt_embeds_mask — Attention mask for pre-computed embeddings.
- max_sequence_length — Maximum output sequence length.
- template_type — Prompt template key (
"image"or"multiple_images").
Encode a text prompt into embeddings (text-only path).
Pre-computed prompt_embeds bypass encoding entirely.
encode_prompt_multiple_images
< source >( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 images: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None template_type: typing.Optional[str] = 'multiple_images' max_sequence_length: typing.Optional[int] = None )
Parameters
- prompt — Prompt string(s), optionally containing
<image>\ntokens. - device — Target device.
- num_images_per_prompt — Number of outputs to generate per prompt.
- images — Pixel tensors corresponding to the inline image tokens.
- prompt_embeds — Pre-computed prompt embeddings.
- prompt_embeds_mask — Attention mask for pre-computed embeddings.
- template_type — Must be
"multiple_images". - max_sequence_length — If set, truncate the output to this length
(keeping the last
max_sequence_lengthtokens).
Encode prompts that contain inline image tokens via the Qwen processor.
<image>\n placeholders in each prompt string are replaced by the Qwen vision special tokens before being
fed to the multimodal encoder.
normalize_latents
< source >( latent: Tensor )
Normalise latents using per-channel statistics from the VAE config.
Uses (latent - mean) / std when the VAE exposes latents_mean and latents_std; otherwise falls back to
scaling by scaling_factor.
prepare_latents
< source >( batch_size: int num_channels_latents: int height: int width: int video_length: int dtype: dtype device: device generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] latents: typing.Optional[torch.Tensor] = None image: typing.Optional[typing.List[PIL.Image.Image]] = None enable_denormalization: bool = True )
Parameters
- batch_size — Number of samples in the batch.
- num_channels_latents — Latent channel dimension from the transformer config.
- height — Spatial height in pixels.
- width — Spatial width in pixels.
- video_length — Number of frames (1 for image inference).
- dtype — Floating-point dtype for the latent tensor.
- device — Target device.
- generator — RNG generator(s) for reproducible sampling.
- latents — Optional user-provided initial noise for the target slot. When
Nonerandom noise is sampled. - image — Optional list of PIL reference images to VAE-encode as conditioning slots.
- enable_denormalization — Whether to normalise encoded reference latents.
Raises
ValueError
ValueError— Ifgeneratoris a list whose length differs frombatch_size.
Prepare the initial noisy latent tensor for the denoising loop.
JoyImageEditPipelineOutput
class diffusers.JoyImageEditPipelineOutput
< source >( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )
Output class for JoyImageEdit generation pipelines.