"In this tutorial, we use the TransCheX model with 2 layers for each of vision, language mixed modality encoders respectively. As an input to the TransCheX, we use the patient **report** and corresponding **chest X-ray image**. The image itself will be divided into non-overlapping patches with a specified patch resolution and projected into an embedding space. Similarly the reports are tokenized and projected into their respective embedding space. The language and vision encoders seperately encode their respective features from the projected embeddings in each modality. Furthmore, the output of vision and language encoders are fed into a mixed modality encoder which extraxts mutual information. The output of the mixed modality encoder is then utilized for the classification application. \n",
0 commit comments