Press "Enter" to skip to content

Hacking your not-so-smart doorbell – Home Assistant and Gemini AI

At DEFCON32, my colleague Andra Lezza and I presented a talk on building and securing LLM applications – particularly chatbots – drawing from our work at Sage. One of the highlights of our talk was a practical proof of concept: a smart home setup using Home-Assistant.io, which we showcased to demonstrate safety implications and security considerations of AI-integrated applications. In this tutorial, I’ll guide you through implementing the same setup so you can experiment with advanced AI notifications and image analysis at your own front door.

Step 1: Setting Up the Smart Doorbell with Gemini AI for Image Recognition

In this setup, I integrated Gemini AI into Home Assistant for powerful image recognition. By subscribing to Gemini AI’s service, I was able to get accurate image analysis on a minimal budget – less than 50 cents per month – even when visitors ring multiple times. This setup allows you to achieve detailed notifications without depending on the smart capabilities of an advanced doorbell. A basic IPC (Internet Protocol Camera) and a standard doorbell are enough to get started, making it far more cost-effective than opting for commercial systems like Ring, which charge monthly subscriptions.

Enhanced Notifications with AI

The AI-powered notifications go beyond typical alerts by analyzing who’s at the door and providing descriptions of the person’s appearance. If the system detects a familiar face, it sends a push notification with specific details, such as:

Ability to describe the static image if someone pressed the doorbell.

With this approach, you get accurate image analysis without paying for ongoing monthly subscriptions, and you have more control over customization than with vendor systems.

Push Notification with the added ability to recognize the faces of the homeowners.

Step 2: Script for Image Analysis

For measuring the push event of the doorbell button, I used an ESP32 microcontroller with ESPHome installed, which enables seamless integration with Home Assistant. When the button is pressed, the ESP32 detects the voltage change and triggers the push event, activating the AI image analysis sequence. This setup allows for reliable event detection and ensures the AI processes the image snapshots immediately.

This script is triggered each time the doorbell button is pressed. The AI then processes the images and parses the output into structured variables for notifications. Here’s how to set it up in Home Assistant:

alias: Analyze Image Sequence using Gemini AI and store output
sequence:
- data:
image_filename:
- /config/www/images443/doorbell-live/doorbell_3.jpg
- /config/www/images443/doorbell-live/doorbell_2.jpg
- /config/www/images443/doorbell-live/doorbell_1.jpg
prompt: REPLACE_WITH_PROMPT
response_variable: generated_content
action: google_generative_ai_conversation.generate_content
- sequence:
- variables:
content_text: "{{ generated_content.text }}"
- variables:
titel: |
{{ content_text | regex_findall_index('Titel: (.*?)\n', 0) }}
message: |
{{ content_text | regex_findall_index('Message: (.*?)\n', 0) }}
code: |
{{ content_text | regex_findall_index('Code: (.*)', 0) }}
- data:
entity_id: input_text.kamera_sequenz_ai_analysiert_titel
value: "{{ titel }}"
action: input_text.set_value
- data:
entity_id: input_text.kamera_sequenz_ai_analysiert_inhalt
value: "{{ message }}"
action: input_text.set_value
- data:
entity_id: input_text.kamera_sequenz_ai_analysiert_code
value: "{{ code }}"
action: input_text.set_value
- metadata: {}
data:
value: "{{generated_content.text}}"
target:
entity_id: input_text.kamera_sequenz_ai_analysiert
action: input_text.set_value
continue_on_error: true
enabled: false
- stop: All Done
response_variable: generated_content
description: AI Gemini

To store these outputs for later use, set up input_text helpers for each variable, enabling you to use the data in notifications or even text-to-speech announcements on your home speakers.

Step 3: Crafting the Prompt for Detailed Image Analysis

The prompt I use with Gemini AI has been refined over time and now consistently produces detailed descriptions. Here’s the current prompt setup:

Describe precisely and in detail what is visible in the image sequence , which consists of three photos taken by my surveillance camera at the front door. The camera was triggered when the doorbell was pressed.
• No static objects/buildings.
• Nothing moving in all images? Answer solely with “Nothing detected.”
• No obvious contexts without details (e.g., “Person at the front door, ringing the bell, making a movement”).
• Do not list analysis criteria or mention what was not done or detected.
• No time or date.
• Person: clothing (uniforms, logos, colors), gender, identifiable expression, emotion, gestures.
• Movement direction: to the left toward the garage, to the right from the driveway, through the door into the house.
• Interactions: knocking, ringing, leaving a package/delivery, tools/flyers in hand, conversations, official actions.
• Posture: upright, bent, searching, delivering, repairing.
• Other moving objects: vehicles (delivery vans, cars with company logos), animals, people.
• Answer with “Secret code” if a person shows the “OK” sign (👌🏼).

Always respond in the following format:
Title: Maximum of 60 characters, a short title for the push notification.
Message: Message for the push notification, maximum 250 characters.
Code: Respond only with “Nothing detected,” “Secret code,” “Delivery service,” “Person,” or “Other.”

This structured prompt ensures the AI provides only relevant details, skipping unnecessary context and making notifications concise and accurate.

Step 4: Voice Notifications and Mobile Push Alerts

With this automation, you can receive mobile push notifications and speaker announcements when someone rings the doorbell. Here’s the configuration:

alias: Speak at Wallpanel and send push notification with AI Output
description: ""
mode: single
triggers:
- entity_id:
- input_text.kamera_sequenz_ai_analysiert_inhalt
from: null
to: null
trigger: state
conditions:
- condition: not
conditions:
- condition: state
entity_id: input_text.kamera_sequenz_ai_analysiert_code
state: Nichts erkannt
actions:
- parallel:
- metadata: {}
data:
message: >-
{{states('input_text.kamera_sequenz_ai_analysiert_titel')}}.
{{states('input_text.kamera_sequenz_ai_analysiert_inhalt')}}
action: rest_command.wallpanel_speak
- metadata: {}
data:
title: "{{states('input_text.kamera_sequenz_ai_analysiert_titel')}}"
message: "{{states('input_text.kamera_sequenz_ai_analysiert_inhalt')}}"
data:
actions:
- action: open_door
title: Tür Öffnen
destructive: true
icon: sfsymbols:bell
authenticationRequired: true
activationMode: background
- action: URI
title: Kamera Anzeigen
uri: /my-smarthome/alarmanlage
icon: sfsymbols:bell
destructive: false
authenticationRequired: false
tag: doorbell_ai
sticky: true
channel: Klingel AI
priority: high
ttl: 0
color: blue
importance: high
vibrationPattern: 100, 1000, 100, 1000, 100
ledColor: blue
persistent: false
visibility: public
alert_once: false
notification_icon: mdi:bell
push:
category: camera
interruption-level: active
sound:
name: default
critical: 0
volume: 1
action: notify.all_apps
enabled: true

Each notification includes interactive options, like opening the door or viewing the live camera feed, giving you full control over your home’s entry point.

Step 5: Capturing Live Camera Sequences

For immediate analysis, I set up a sequence of images that the AI can review as soon as the doorbell rings. Here’s the shell command that runs every 5 seconds:

shell_command:
ipc_create_live_sequence: '/bin/bash /config/custom_scripts/ipc_create_live_sequence.sh'

And here’s the shell script:

#!/bin/bash

# Attempt to download the latest image with a timeout of 3 seconds
timeout 3 wget "http://192.168.178.XXX/cgi-bin/api.cgi?cmd=Snap&user=XXX&password=XXX&width=854&height=480" -O "/config/www/images443/doorbell-live/doorbell_new.jpg"

# Check if the download was successful and process the image sequence
if [ -f /config/www/images443/doorbell-live/doorbell_new.jpg ]; then
# Move doorbell_2.jpg to doorbell_3.jpg if it exists
if [ -f /config/www/images443/doorbell-live/doorbell_2.jpg ]; then
mv /config/www/images443/doorbell-live/doorbell_2.jpg /config/www/images443/doorbell-live/doorbell_3.jpg 2>/dev/null || true
fi

# Move doorbell_1.jpg to doorbell_2.jpg if it exists
if [ -f /config/www/images443/doorbell-live/doorbell_1.jpg ]; then
mv /config/www/images443/doorbell-live/doorbell_1.jpg /config/www/images443/doorbell-live/doorbell_2.jpg 2>/dev/null || true
fi

# Rename the new image to doorbell_1.jpg
mv /config/www/images443/doorbell-live/doorbell_new.jpg /config/www/images443/doorbell-live/doorbell_1.jpg 2>/dev/null
fi

Security Insights: Avoiding Prompt Injection Risks

Initially, I added a feature to unlock the door automatically if the camera detected one of our faces. However, this presented a serious security risk, as someone could hold up a printed image or use indirect prompt injection to bypass the system. To avoid such vulnerabilities, I removed this feature and recommend using manual confirmation for critical actions.

Note used by intruder to unlock my frontdoor with a indirect prompt injection.

Watch the Full Talk

2 Comments

  1. ColtonYYZ
    ColtonYYZ 20. November 2024

    Where you say to “REPLACE_WITH_PROMPT”, am I to paste your whole prompt you posted in that spot? If I do, the yaml barks at me due to errors. Please advise. Thank you!

    • Javan Rasokat
      Javan Rasokat 21. November 2024

      Before replacing the text, switch to UI-mode instead of yaml and then paste the text of the prompt into the input field.

Leave a Reply

Your email address will not be published. Required fields are marked *