![]() ![]() Meta views the tech as eventually expanding beyond its current six “senses,” so to speak. ![]() This is important because it’s not feasible for researchers to create datasets with samples that contain, for example, audio data and thermal data from a busy city street, or depth data and a text description of a seaside cliff.” “ImageBind shows that it’s possible to create a joint embedding space across multiple modalities without needing to train on data with every different combination of modalities. “In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each respective modality,” said Meta. “For example, a creator could couple an image with an alarm clock and a rooster crowing, and use a crowing audio prompt to segment the rooster or the sound of an alarm to segment the clock and animate both into a video sequence.” Meta’s graph showing ImageBind’s accuracy outperforming single-mode models.īy subscribing, you are agreeing to Engadget's Terms and Privacy Policy. “This creates distinctive opportunities to create animations out of static images by combining them with audio prompts,” Meta researchers said today in a developer-focused blog post. So, while you can use Midjourney to prompt “a basset hound wearing a Gandalf outfit while balancing on a beach ball” and get a relatively realistic photo of this bizarre scene, a multimodal AI tool like ImageBind may eventually create a video of the dog with corresponding sounds, including a detailed suburban living room, the room’s temperature and the precise locations of the dog and anyone else in the scene. (The more aware you are of your surroundings, the more you can avoid danger and adapt to your environment for better survival and prosperity.) As computers get closer to mimicking animals’ multi-sensory connections, they can use those links to generate fully realized scenes based only on limited chunks of data. Humans and other animals evolved to process this data for our genetic advantage: survival and passing on our DNA. For example, if you’re standing in a stimulating environment like a busy city street, your brain (largely unconsciously) absorbs the sights, sounds and other sensory experiences to infer information about passing cars and pedestrians, tall buildings, weather and much more. You could view ImageBind as moving machine learning closer to human learning. It’s an early stage of a framework that could eventually generate complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three). It can link text, images / videos, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) - and it does this without having to first train on every possibility. While image generators like Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to generate visual scenes based only on a text description, ImageBind casts a broader net. Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how humans perceive or imagine an environment. ![]()
0 Comments
Leave a Reply. |