Astrophysics & AI with Python: Hunting for Earth 2.0 with Kepler Data and Vision Transformers

The search for another Earth isn't just a job for astronomers with telescopes—it's a massive data challenge that requires the sharp mind of a Data Scientist. Since the Kepler Space Telescope began its mission in 2009, we have transitioned from the era of rare, manual discoveries to a data deluge of petabytes. Kepler monitored 150,000 stars simultaneously, generating a mountain of photometric data.

The challenge is no longer finding data; it's classifying it. How do we sift through billions of signals to distinguish a genuine, habitable-zone planet from instrumental noise, stellar flares, or complex astrophysical false positives?

In this chapter of our capstone project, we will build an AI Exoplanet Hunter. We will leverage Python, time-series analysis, and cutting-edge 1D Vision Transformers to build a robust classifier capable of identifying planets in Kepler data.

The Challenge: Finding a Needle in a Cosmic Haystack

To understand the AI solution, we first need to understand the physics and the noise.

The Transit Method & The Security Camera Analogy

Kepler uses the Transit Photometry method. Imagine a massive security camera system pointed at 150,000 storefront windows (stars) for four years.

The Star: The window, constantly illuminated.
The Light Curve: The video feed recording brightness over time.
The Planet Transit: A person briefly walking past the window, causing a tiny dip in brightness (often just 0.01% to 1%).
The Noise: Shadows (starspots), reflections (instrumental drift), and birds (stellar flares).

The raw data is a Time Series of flux measurements. It is inherently noisy. The true signal is buried deep beneath the noise floor.

Phase Folding: The Signal Amplifier

Raw light curves are chaotic. To make the subtle, periodic transit signal visible, we use a technique called Phase Folding.

A planet orbits periodically. If a transit occurs today, it happens again after exactly one orbital period (\(P\)). Phase folding mathematically wraps the entire multi-year light curve onto itself using \(P\) as the "wrapping length."

\[ \text{Phase} = \frac{(t - T_0) \pmod{P}}{P} \]

If we guess the period correctly, all individual transit dips align perfectly, reinforcing the signal and averaging out random noise. The result is a single, normalized curve where the transit appears as a sharp, U-shaped trough centered at phase 0.

The False Positive Conundrum

Even with phase folding, the AI must solve the "False Positive" problem. Three main classes of signals confuse models:

True Exoplanets (TPE): Shallow, U-shaped dips.
Eclipsing Binary Stars (EB): Two stars orbiting each other. These create deep dips and a distinct secondary eclipse (when the smaller star passes behind the larger one).
Background Eclipsing Binaries (BEB): The most dangerous false positive. A distant binary star mimics a shallow planetary transit because its light is diluted by the target star.

The AI must look at the morphology (shape) of the dip to succeed.

The AI Solution: 1D Vision Transformers

Historically, we used simple statistics or CNNs. However, CNNs struggle with long-range dependencies—like seeing a faint secondary eclipse far away from the primary dip.

This is where Vision Transformers (ViTs) come in. While we explored Transformers in NLP (Book 12), the Self-Attention mechanism is perfectly transferable to 1D time-series data.

We treat the phase-folded light curve as a "sentence." The Transformer allows every point in the sequence to interact with every other point. It learns the global structure: * "Is this a U-shape (Planet) or a V-shape (Binary)?" * "Is there a secondary dip at phase 0.5?"

This provides not just higher accuracy, but interpretability via attention maps, showing us exactly which parts of the light curve the AI is "looking" at.

Python Implementation: Data Acquisition & Cleaning

Before we can train a Transformer, we need to acquire and clean the data. We will use the lightkurve library, the standard toolkit for Kepler/TESS data.

Part 1: Data Loading and Normalization

This script demonstrates how to load raw data, remove bad quality flags, and detrend the light curve to isolate the transit signal.

import numpy as np
import pandas as pd
import lightkurve as lk
from scipy.signal import savgol_filter

def load_and_clean_kepler_data(kic_id: int, quarter: int) -> pd.DataFrame:
    """
    Loads a raw Kepler light curve, removes flagged quality issues,
    and performs detrending/normalization.
    """
    print(f"Loading Kepler data for KIC {kic_id}, Quarter {quarter}...")

    try:
        # Fetching data using lightkurve
        lc_collection = lk.search_lightcurve(f'KIC {kic_id}', quarter=quarter, cadence='long').download_all()
        if not lc_collection: return None
        lc = lc_collection.stitch()
    except Exception as e:
        print(f"Download failed: {e}. Using simulated data.")
        # Simulated data for demonstration
        time = np.linspace(0, 90, 1000)
        flux = 1.0 + 0.005 * np.sin(time / 10) + np.random.normal(0, 0.001, 1000)
        lc = lk.LightCurve(time=time, flux=flux)

    # 1. Clean: Remove NaNs and outliers
    lc_clean = lc.remove_nans().remove_outliers(sigma=5)

    # 2. Normalize: Center flux around 1.0
    lc_norm = lc_clean.normalize()

    # 3. Detrend: Remove long-term stellar trends (rotation/drift)
    # We use a Savitzky-Golay filter to smooth out long periods
    window_length = 201  # Must be odd
    polyorder = 3

    trend = savgol_filter(lc_norm.flux.value, window_length, polyorder)
    detrended_flux = lc_norm.flux.value / trend

    return pd.DataFrame({'time': lc_norm.time.value, 'flux': detrended_flux})

# Example usage:
# data_df = load_and_clean_kepler_data(kic_id=8462852, quarter=1)
# print(data_df.head())

Part 2: Secure Data Access in Production

When moving from a Jupyter Notebook to a production pipeline, we cannot hardcode API keys. We use Environment Variables to manage credentials securely. This script simulates a professional workflow for fetching data using secure tokens.

import os
import lightkurve as lk
import logging
from typing import Optional

# Setup logging for production visibility
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# --- Configuration Management ---
# Securely retrieve API token from environment variables
MAST_API_KEY_NAME = "KEPLER_MAST_TOKEN"
mast_token: Optional[str] = os.environ.get(MAST_API_KEY_NAME)

def fetch_kepler_data(kic_id: str, mission: str, cadence: str, token: Optional[str]) -> Optional[lk.LightCurve]:
    """
    Searches and downloads Kepler data using secure token handling.
    """
    logger.info(f"Starting data fetch for {kic_id}.")

    # Security check: Log status without exposing the key
    if token:
        logger.info("MAST Token found. Configuring secure access.")
    else:
        logger.warning("No Token found. Proceeding with public access.")

    try:
        # Search MAST archive
        search_results = lk.search_lightcurve(target=kic_id, mission=mission, cadence=cadence)

        if not search_results:
            logger.error(f"No data found for {kic_id}.")
            return None

        # Download data
        lc_collection = search_results.download_all()

        if lc_collection:
            logger.info(f"Successfully downloaded {len(lc_collection)} files.")
            return lc_collection[0] # Return first light curve
        else:
            logger.error("Download failed.")
            return None

    except Exception as e:
        logger.critical(f"Critical error: {e}")
        return None

# --- Execution ---
# In a real environment, set the variable in your terminal:
# export KEPLER_MAST_TOKEN="your_secret_token"

target_star = "KIC 11904151" # Kepler-10
lc_data = fetch_kepler_data(target_star, 'Kepler', 'long', mast_token)

if lc_data:
    print(f"Data ready for phase folding. Points: {len(lc_data.time)}")

Conclusion

By combining robust data pipelines with advanced Deep Learning architectures like 1D Vision Transformers, we can automate the hunt for exoplanets. This approach moves beyond simple signal detection to understanding the complex morphology of stellar events, effectively training the AI to distinguish between a new Earth and a binary star shadow.

Let's Discuss

Morphology vs. Metrics: Do you think a Transformer's ability to understand the "shape" of a signal is more valuable than traditional statistical metrics (like transit depth) for detecting complex false positives?
Data Security: In your own data science projects, how have you implemented environment variables or configuration management to handle sensitive API keys? Did you find os.environ.get() sufficient, or do you use other tools like python-dotenv?

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook Astrophysics & AI: Building Research Agents for Astronomy, Cosmology, and SETI. You can find it here: Leanpub.com or here: Amazon.com. Check all the other programming ebooks on python, typescript, c#: Leanpub.com or Amazon.com.

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.