Diffusers documentation
LongCatImageTransformer2DModel
Get started
Pipelines
Adapters
Inference
Inference optimization
Modular Diffusers
Training
Quantization
Model accelerators and hardware
Resources
API
Main Classes
Modular
Loaders
Models
OverviewAutoModel
ControlNets
Transformers
AceStepTransformer1DModelAllegroTransformer3DModelAnyFlowFARTransformer3DModelAnyFlowTransformer3DModelAuraFlowTransformer2DModelBriaFiboTransformer2DModelBriaTransformer2DModelChromaTransformer2DModelChronoEditTransformer3DModelCogVideoXTransformer3DModelCogView3PlusTransformer2DModelCogView4Transformer2DModelConsisIDTransformer3DModelCosmos3OmniTransformerCosmosTransformer3DModelDiTTransformer2DModelEasyAnimateTransformer3DModelErnieImageTransformer2DModelFlux2Transformer2DModelFluxTransformer2DModelGlmImageTransformer2DModelHeliosTransformer3DModelHiDreamImageTransformer2DModelHunyuanDiT2DModelHunyuanImageTransformer2DModelHunyuanVideo15Transformer3DModelHunyuanVideoTransformer3DModelIdeogram4Transformer2DModelJoyImageEditTransformer3DModelKrea2Transformer2DModelLatteTransformer3DModelLongCatImageTransformer2DModelLTX2VideoTransformer3DModelLTXVideoTransformer3DModelLumina2Transformer2DModelLuminaNextDiT2DModelMochiTransformer3DModelMotifVideoTransformer3DModelOmniGenTransformer2DModelOvisImageTransformer2DModelPixArtTransformer2DModelPriorTransformerQwenImageTransformer2DModelSanaTransformer2DModelSanaVideoTransformer3DModelSD3Transformer2DModelSkyReelsV2Transformer3DModelStableAudioDiTModelTransformer2DModelTransformerTemporalModelWanAnimateTransformer3DModelWanTransformer3DModelZImageTransformer2DModel
UNets
VAEs
Pipelines
Schedulers
Internal classes
LongCatImageTransformer2DModel
The model can be loaded with the following code snippet.
from diffusers import LongCatImageTransformer2DModel
transformer = LongCatImageTransformer2DModel.from_pretrained("meituan-longcat/LongCat-Image ", subfolder="transformer", torch_dtype=torch.bfloat16)LongCatImageTransformer2DModel
class diffusers.LongCatImageTransformer2DModel
< source >( patch_size: int = 1 in_channels: int = 64 num_layers: int = 19 num_single_layers: int = 38 attention_head_dim: int = 128 num_attention_heads: int = 24 joint_attention_dim: int = 3584 pooled_projection_dim: int = 3584 axes_dims_rope: list = [16, 56, 56] )
The Transformer model introduced in Longcat-Image.
forward
< source >( hidden_states: Tensor encoder_hidden_states: Tensor = None timestep: LongTensor = None img_ids: Tensor = None txt_ids: Tensor = None guidance: Tensor = None return_dict: bool = True )
Parameters
- hidden_states (
torch.FloatTensorof shape(batch size, channel, height, width)) — Inputhidden_states. - encoder_hidden_states (
torch.FloatTensorof shape(batch size, sequence_len, embed_dims)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. - timestep (
torch.LongTensor) — Used to indicate denoising step. - img_ids (
torch.Tensor) — Image position ids used to compute the rotary positional embeddings. - txt_ids (
torch.Tensor) — Text position ids used to compute the rotary positional embeddings. - guidance (
torch.Tensor, optional) — Guidance scale embedding used for guidance-distilled variants of the model. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.
The forward method.