-
Notifications
You must be signed in to change notification settings - Fork 81
feat: Adding multi modal support for PGVectorStore #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
gcs_uri = re.match("gs://(.*?)/(.*)", uri) | ||
if gcs_uri: | ||
bucket_name, object_name = gcs_uri.groups() | ||
storage_client = storage.Client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to wrap this in a try except block to provide a more clear error or do you think the error is clear if they are not running in a Google Cloud environment or have set up credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other langchain packages don't have running integrations tests for 3P providers. We could mock this test or just test this functionality in our package downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is the error
google.auth.exceptions.DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.
I think this is pretty descriptive, let me know what you think.
The options for the tests are:
- If we want we can try making the images publicly accessible on a GCP project (which claims that we would not need credentials to fetch it).
- We could also store the image directly and skip testing the pathway of GCP.
- Not test the add_images at all.
What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If GCS storage call can be easily mock, let's go ahead and do that. If it can't let's keep the test but skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the mock solutions may need more debugging, I've removed the gcs uri from being tested, the other images being created locally are still under the test.
I will recreate the GCS path testing in our libraries.
|
||
web_uri = re.match(r"^(https?://).*", uri) | ||
if web_uri: | ||
response = requests.get(uri, stream=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an SSRF attack
|
||
async def aadd_images( | ||
self, | ||
uris: list[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting URIs without safe guards is an SSRF attack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that SSRF attacks are generally dealt with by the application layer. Is that correct, or is it more of a framework responsibility?
import uuid | ||
from typing import Any, Callable, Iterable, Optional, Sequence | ||
|
||
import numpy as np | ||
import requests | ||
from google.cloud import storage # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The dependency should be optional not required. So the import cannot appear in the global namespace.
- The actual implementation should be against a key-value store interface not specifically google cloud storage. You can use the LangChain key-value store abstraction to support cloud storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the import from global namespace.
I'm not sure how I should use the key-value store in this case. Could you please point me to the right usage?
blob = bucket.blob(object_name) | ||
return base64.b64encode(blob.download_as_bytes()).decode("utf-8") | ||
|
||
web_uri = re.match(r"^(https?://).*", uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regular expressions are not a good solution here. Simple prefix matching is more robust and probably will also be faster in terms of actual run time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used urlparse instead. If prefix matching is preferred, I can make that change
feat: Adding multi modal support for PGVectorStore
similarity_search_image()
andasimilarity_search_image()
.add_images()
andaadd_images()
endpoints to add images to vector store.