Added Check For Unicode to BigBearScripts

:loudspeaker: Announcing the New big-bear-scripts/check-for-unicode: Your Shield Against Invisible Unicode Threats! :rocket:

I’m thrilled to introduce a powerful new addition to the big-bear-scripts collection: the check-for-unicode script. This tool is designed to provide robust security by identifying subtle yet dangerous Unicode characters that can be exploited in various attacks.

The Problem: Hidden Dangers in Plain Sight

Unicode, while essential for global communication, also allows for complex character behaviors that can be misused. Invisible characters, bidirectional text overrides, and visually similar “homoglyphs” can be employed to hide malicious code, manipulate text rendering, and facilitate sophisticated phishing or “Trojan Source” attacks (like CVE-2021-42574). These threats are often imperceptible to the human eye during code reviews or content moderation, posing a serious risk to supply chain security and the integrity of your code, especially for AI systems.

My Solution: Comprehensive Unicode Threat Detection

The new check-for-unicode/run.sh script offers a foundational scan to detect these hidden dangers.

What Does It Do?

This script detects over 50 types of potentially dangerous Unicode characters, providing comprehensive protection against:

  • Comprehensive Bidirectional Text Controls: Identifies characters used in “Trojan Source” attacks (CVE-2021-42574) that can visually reorder source code, making it appear different to humans than to compilers.
  • Expanded Invisible and Zero-Width Characters: Beyond common zero-width spaces, it targets a wide array of characters designed to be imperceptible. @done(2025-05-31 08:21 PM)
  • Annotation and Formatting Characters: Detects characters that can hide information or alter how text is interpreted and presented.
  • Line and Paragraph Separators: Pinpoints characters that can be abused to break parsing logic or manipulate document structure.
  • Variation Selectors: Uncovers characters that can subtly change the visual representation of other characters, facilitating visual spoofing.
  • Script-Specific Fillers and Other Formats: Covers additional characters used in various language scripts that could be leveraged maliciously.

This powerful detection capability significantly strengthens your security posture against:

  • Trojan Source Attacks (CVE-2021-42574): Protect your codebase from hidden logic flaws.
  • Visual Spoofing and Homograph Attacks: Prevent deceptive visual representations of text.
  • Invisible Character Injection: Uncover hidden data or control flow manipulations.
  • Text Direction Manipulation: Detect attempts to alter text flow for obfuscation or attack.

How it Helps You:

Whether you’re a developer, a security professional, or simply concerned about the integrity of your digital content, this script is an invaluable addition to your toolkit. It’s perfect for:

  • Code Review: Catching hidden vulnerabilities before they make it into production.
  • Content Moderation: Identifying potentially malicious text in user-generated content.
  • AI System Security: Protecting against Unicode-based prompt injection and data poisoning.
  • Data Validation: Ensuring clean and secure text data across your systems.

Get Started Now!

You can run the script with a quick command or by cloning the repository for local usage:

Quick Run (Remote)

bash -c "$(wget -qLO - https://raw.githubusercontent.com/bigbeartechworld/big-bear-scripts/master/check-for-unicode/run.sh)" -- .

Local Usage

# Scan a file or directory
./run.sh /path/to/your/code

Find out more in the README:

I’m excited to offer this new tool to help you secure your projects against advanced Unicode-based threats.

Credits

This tool was inspired by a conversation about Unicode vulnerabilities in AI code, prompting its development to enhance security.