Multi-modal capability has expanded significantly:
Native fusion: Models like Gemini 3 and Llama 4 are trained with text and visual content together, not bolted on separately. This produces deeper understanding.
Computer use: Claude Opus 4.5 and Sonnet 4.5 can interact with graphical interfaces, navigating websites and applications to complete tasks. Claude achieves 66.3% on OSWorld benchmarks for real-world computer tasks.
Video understanding: Models can process video content, understanding temporal sequences and extracting information across frames.
Document intelligence: Reading PDFs, forms, tables, and complex document layouts with high accuracy.
Audio processing: Native speech understanding and generation integrated with other modalities.