This project automates the process of fetching trending repositories from GitHub, storing them in a PostgreSQL database on Google Cloud SQL. It uses the GitHub API, Google Cloud SQL Connector, and SQLAlchemy
- Automated Data Collection: Fetches trending repositories from GitHub based on specified criteria (e.g., date range, number of stars).
- Google Cloud SQL Integration: Stores repository data in a PostgreSQL database on Google Cloud SQL for persistent storage and scalability.
- Key Metrics Tracking: Tracks key metrics such as stars, forks, open issues, and pull requests over time.
- Modular Design: Separates concerns into distinct modules for API interaction, database handling, and main application logic.
- Environment Variable Configuration: Uses environment variables for sensitive information such as API tokens and database credentials.
- GitHub Actions Automation: Automates the data collection and storage process using GitHub Actions for scheduled execution.
- Python 3.11 or higher: Ensure you have Python 3.11 or higher installed.
- GitHub Personal Access Token: Create a personal access token with the
public_repo
scope. - Google Cloud Account: You'll need a Google Cloud account with billing enabled.
- Google Cloud SDK (gcloud): Install and configure the Google Cloud SDK.
- Google Cloud SQL Instance: Create a PostgreSQL instance on Google Cloud SQL.
- Environment Variables: Set the following environment variables:
GIT_TOKEN
: Your GitHub Personal Access Token.DB_CONNECTION_NAME
: The connection name of your Google Cloud SQL instance (e.g.,project:region:instance
).DB_NAME
: The name of the PostgreSQL database.DB_USER
: The PostgreSQL username.DB_PASSWORD
: The PostgreSQL password.GCP_SA_KEY
: The contents of your Google Cloud Service Account key file.
-
Clone the Repository:
git clone https://github.com/your-username/github-pipeline.git cd github-pipeline
-
Create a Virtual Environment:
python -m venv .venv source .venv/bin/activate # On Linux/macOS .venv\Scripts\activate # On Windows
-
Install Dependencies:
pip install -r requirements.txt
-
Configure Environment Variables:
-
Set up Google Cloud Authentication:
- Create a Google Cloud Service Account with the necessary permissions to access Cloud SQL.
- Download the Service Account key file in JSON format.
- Store the contents of the JSON key file as a GitHub secret named
GCP_SA_KEY
.
-
Run the Main Script:
python main.py
This script will:
- Fetch trending repositories from GitHub.
- Store the repository information in the
repositories
table. - Store the trending metrics (stars, forks) in the
trends
table. - Store the issues and pull requests data in the
issues_prs
table.
-
Run the dashboard locally:
streamlit run dashboard.py
-
Get the top 10 repos with most stars from the Google Cloud SQL Instance
-
Display them in a pretty way
-
The project uses the following database schema in PostgreSQL:
-
repositories
Table:- id (SERIAL PRIMARY KEY): Unique identifier for the repository.
- repo_id (INTEGER UNIQUE): GitHub repository ID.
- name (VARCHAR(255) NOT NULL): Repository name.
- full_name (VARCHAR(255) UNIQUE NOT NULL): Full repository name (owner/repo).
- description (TEXT): Repository description.
- language (VARCHAR(255)): Primary programming language.
- owner (VARCHAR(255)): Repository owner's username.
- owner_url (VARCHAR(255)): URL of the repository owner's profile.
- html_url (VARCHAR(255) UNIQUE): Repository URL.
- created_at (TIMESTAMP): Repository creation timestamp.
- updated_at (TIMESTAMP): Repository last updated timestamp.
- collected_at (TIMESTAMP DEFAULT CURRENT_TIMESTAMP): Timestamp when the repository data was collected.
-
trends
Table:- id (SERIAL PRIMARY KEY): Unique identifier for the trend record.
- repo_id (INTEGER, FOREIGN KEY referencing
repositories.repo_id
): GitHub repository ID. - stars (INTEGER): Number of stars.
- forks (INTEGER): Number of forks.
- recorded_at (TIMESTAMP DEFAULT CURRENT_TIMESTAMP): Timestamp when the trend data was recorded.
-
issues_prs
Table:- id (SERIAL PRIMARY KEY): Unique identifier for the issues/PRs record.
- repo_id (INTEGER, FOREIGN KEY referencing
repositories.repo_id
): GitHub repository ID. - open_issues (INTEGER): Number of open issues.
- closed_issues (INTEGER): Number of closed issues.
- open_prs (INTEGER): Number of open pull requests.
- closed_prs (INTEGER): Number of closed pull requests.
- recorded_at (TIMESTAMP DEFAULT CURRENT_TIMESTAMP): Timestamp when the issues/PRs data was recorded.
The project includes a GitHub Actions workflow (.github/workflows/run_main.yml
) that automates the data collection and storage process. The workflow is triggered on a schedule (e.g., daily) and performs the following steps:
- Checks out the repository.
- Sets up Python 3.11.
- Installs the project dependencies.
- Authenticates with Google Cloud using a Service Account key.
- Executes the main.py script.
To configure the GitHub Actions workflow:
- Enable GitHub Actions in your repository settings.
- Add the following secrets to your repository settings:
GIT_TOKEN
: Your GitHub Personal Access Token.DB_CONNECTION_NAME
: The connection name of your Google Cloud SQL instance.DB_NAME
: The name of your PostgreSQL database.DB_USER
: The username for your PostgreSQL database.DB_PASSWORD
: The password for your PostgreSQL database.GCP_SA_KEY
: The contents of your Google Cloud Service Account key file.
This project is licensed under the MIT License.