Hacking your not-so-smart doorbell - Home Assistant and Gemini AI

At DEFCON32, my colleague Andra Lezza and I presented a talk on building and securing LLM applications – particularly chatbots – drawing from our work at Sage. One of the highlights of our talk was a practical proof of concept: a smart home setup using Home-Assistant.io, which we showcased to demonstrate safety implications and security considerations of AI-integrated applications. In this tutorial, I’ll guide you through implementing the same setup so you can experiment with advanced AI notifications and image analysis at your own front door.

Step 1: Setting Up the Smart Doorbell with Gemini AI for Image Recognition

In this setup, I integrated Gemini AI into Home Assistant for powerful image recognition. By subscribing to Gemini AI’s service, I was able to get accurate image analysis on a minimal budget – less than 50 cents per month – even when visitors ring multiple times. This setup allows you to achieve detailed notifications without depending on the smart capabilities of an advanced doorbell. A basic IPC (Internet Protocol Camera) and a standard doorbell are enough to get started, making it far more cost-effective than opting for commercial systems like Ring, which charge monthly subscriptions.

Enhanced Notifications with AI

The AI-powered notifications go beyond typical alerts by analyzing who’s at the door and providing descriptions of the person’s appearance. If the system detects a familiar face, it sends a push notification with specific details, such as:

Ability to describe the static image if someone pressed the doorbell.

With this approach, you get accurate image analysis without paying for ongoing monthly subscriptions, and you have more control over customization than with vendor systems.

Push Notification with the added ability to recognize the faces of the homeowners.

Step 2: Script for Image Analysis

For measuring the push event of the doorbell button, I used an ESP32 microcontroller with ESPHome installed, which enables seamless integration with Home Assistant. When the button is pressed, the ESP32 detects the voltage change and triggers the push event, activating the AI image analysis sequence. This setup allows for reliable event detection and ensures the AI processes the image snapshots immediately.

This script is triggered each time the doorbell button is pressed. The AI then processes the images and parses the output into structured variables for notifications. Here’s how to set it up in Home Assistant:

alias: Analyze Image Sequence using Gemini AI and store output
sequence:
  - data:
      image_filename:
        - /config/www/images443/doorbell-live/doorbell_3.jpg
        - /config/www/images443/doorbell-live/doorbell_2.jpg
        - /config/www/images443/doorbell-live/doorbell_1.jpg
      prompt: REPLACE_WITH_PROMPT
    response_variable: generated_content
    action: google_generative_ai_conversation.generate_content
  - sequence:
      - variables:
          content_text: "{{ generated_content.text }}"
      - variables:
          titel: |
            {{ content_text | regex_findall_index('Titel: (.*?)\n', 0) }}
          message: |
            {{ content_text | regex_findall_index('Message: (.*?)\n', 0) }}
          code: |
            {{ content_text | regex_findall_index('Code: (.*)', 0) }}
      - data:
          entity_id: input_text.kamera_sequenz_ai_analysiert_titel
          value: "{{ titel }}"
        action: input_text.set_value
      - data:
          entity_id: input_text.kamera_sequenz_ai_analysiert_inhalt
          value: "{{ message }}"
        action: input_text.set_value
      - data:
          entity_id: input_text.kamera_sequenz_ai_analysiert_code
          value: "{{ code }}"
        action: input_text.set_value
  - metadata: {}
    data:
      value: "{{generated_content.text}}"
    target:
      entity_id: input_text.kamera_sequenz_ai_analysiert
    action: input_text.set_value
    continue_on_error: true
    enabled: false
  - stop: All Done
    response_variable: generated_content
description: AI Gemini

To store these outputs for later use, set up input_text helpers for each variable, enabling you to use the data in notifications or even text-to-speech announcements on your home speakers.

Step 3: Crafting the Prompt for Detailed Image Analysis

The prompt I use with Gemini AI has been refined over time and now consistently produces detailed descriptions. Here’s the current prompt setup:

Describe precisely and in detail what is visible in the image sequence , which consists of three photos taken by my surveillance camera at the front door. The camera was triggered when the doorbell was pressed.
• No static objects/buildings.
• Nothing moving in all images? Answer solely with “Nothing detected.”
• No obvious contexts without details (e.g., “Person at the front door, ringing the bell, making a movement”).
• Do not list analysis criteria or mention what was not done or detected.
• No time or date.
• Person: clothing (uniforms, logos, colors), gender, identifiable expression, emotion, gestures.
• Movement direction: to the left toward the garage, to the right from the driveway, through the door into the house.
• Interactions: knocking, ringing, leaving a package/delivery, tools/flyers in hand, conversations, official actions.
• Posture: upright, bent, searching, delivering, repairing.
• Other moving objects: vehicles (delivery vans, cars with company logos), animals, people.
• Answer with “Secret code” if a person shows the “OK” sign (👌🏼).

Always respond in the following format:
Title: Maximum of 60 characters, a short title for the push notification.
Message: Message for the push notification, maximum 250 characters.
Code: Respond only with “Nothing detected,” “Secret code,” “Delivery service,” “Person,” or “Other.”

This structured prompt ensures the AI provides only relevant details, skipping unnecessary context and making notifications concise and accurate.

Step 4: Voice Notifications and Mobile Push Alerts

With this automation, you can receive mobile push notifications and speaker announcements when someone rings the doorbell. Here’s the configuration:

alias: Speak at Wallpanel and send push notification with AI Output
description: ""
mode: single
triggers:
  - entity_id:
      - input_text.kamera_sequenz_ai_analysiert_inhalt
    from: null
    to: null
    trigger: state
conditions:
  - condition: not
    conditions:
      - condition: state
        entity_id: input_text.kamera_sequenz_ai_analysiert_code
        state: Nichts erkannt
actions:
  - parallel:
      - metadata: {}
        data:
          message: >-
            {{states('input_text.kamera_sequenz_ai_analysiert_titel')}}.
            {{states('input_text.kamera_sequenz_ai_analysiert_inhalt')}}
        action: rest_command.wallpanel_speak
      - metadata: {}
        data:
          title: "{{states('input_text.kamera_sequenz_ai_analysiert_titel')}}"
          message: "{{states('input_text.kamera_sequenz_ai_analysiert_inhalt')}}"
          data:
            actions:
              - action: open_door
                title: Tür Öffnen
                destructive: true
                icon: sfsymbols:bell
                authenticationRequired: true
                activationMode: background
              - action: URI
                title: Kamera Anzeigen
                uri: /my-smarthome/alarmanlage
                icon: sfsymbols:bell
                destructive: false
                authenticationRequired: false
            tag: doorbell_ai
            sticky: true
            channel: Klingel AI
            priority: high
            ttl: 0
            color: blue
            importance: high
            vibrationPattern: 100, 1000, 100, 1000, 100
            ledColor: blue
            persistent: false
            visibility: public
            alert_once: false
            notification_icon: mdi:bell
            push:
              category: camera
              interruption-level: active
              sound:
                name: default
                critical: 0
                volume: 1
        action: notify.all_apps
    enabled: true

Each notification includes interactive options, like opening the door or viewing the live camera feed, giving you full control over your home’s entry point.

Step 5: Capturing Live Camera Sequences

For immediate analysis, I set up a sequence of images that the AI can review as soon as the doorbell rings. Here’s the shell command that runs every 5 seconds:

shell_command:
  ipc_create_live_sequence: '/bin/bash /config/custom_scripts/ipc_create_live_sequence.sh'

And here’s the shell script:

#!/bin/bash

# Attempt to download the latest image with a timeout of 3 seconds
timeout 3 wget "http://192.168.178.XXX/cgi-bin/api.cgi?cmd=Snap&user=XXX&password=XXX&width=854&height=480" -O "/config/www/images443/doorbell-live/doorbell_new.jpg"

# Check if the download was successful and process the image sequence
if [ -f /config/www/images443/doorbell-live/doorbell_new.jpg ]; then
  # Move doorbell_2.jpg to doorbell_3.jpg if it exists
  if [ -f /config/www/images443/doorbell-live/doorbell_2.jpg ]; then
    mv /config/www/images443/doorbell-live/doorbell_2.jpg /config/www/images443/doorbell-live/doorbell_3.jpg 2>/dev/null || true
  fi

  # Move doorbell_1.jpg to doorbell_2.jpg if it exists
  if [ -f /config/www/images443/doorbell-live/doorbell_1.jpg ]; then
    mv /config/www/images443/doorbell-live/doorbell_1.jpg /config/www/images443/doorbell-live/doorbell_2.jpg 2>/dev/null || true
  fi

  # Rename the new image to doorbell_1.jpg
  mv /config/www/images443/doorbell-live/doorbell_new.jpg /config/www/images443/doorbell-live/doorbell_1.jpg 2>/dev/null
fi

Security Insights: Avoiding Prompt Injection Risks

Initially, I added a feature to unlock the door automatically if the camera detected one of our faces. However, this presented a serious security risk, as someone could hold up a printed image or use indirect prompt injection to bypass the system. To avoid such vulnerabilities, I removed this feature and recommend using manual confirmation for critical actions.

Note used by intruder to unlock my frontdoor with a indirect prompt injection.

Watch the Full Talk

2 Comments

ColtonYYZ 20. November 2024

Where you say to “REPLACE_WITH_PROMPT”, am I to paste your whole prompt you posted in that spot? If I do, the yaml barks at me due to errors. Please advise. Thank you!

- Javan Rasokat 21. November 2024
  
  Before replacing the text, switch to UI-mode instead of yaml and then paste the text of the prompt into the input field.