Skip to content

Enable fexec-charset option #138895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions clang/docs/LanguageExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -416,8 +416,7 @@ Builtin Macros
``__clang_literal_encoding__``
Defined to a narrow string literal that represents the current encoding of
narrow string literals, e.g., ``"hello"``. This macro typically expands to
"UTF-8" (but may change in the future if the
``-fexec-charset="Encoding-Name"`` option is implemented.)
the text encoding specified by -fexec-charset if specified, or the system charset.
Comment on lines 417 to +419
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "typically" here seems wrong -- specifying -fexec-charset is atypical, and if it's specified then the macro always (not only typically) expands to that. Also referring to "the system charset" doesn't really seem right, given that for non-z/OS we use UTF-8 regardless of what the operating system would consider to be its character set. How about:

Suggested change
Defined to a narrow string literal that represents the current encoding of
narrow string literals, e.g., ``"hello"``. This macro typically expands to
"UTF-8" (but may change in the future if the
``-fexec-charset="Encoding-Name"`` option is implemented.)
the text encoding specified by -fexec-charset if specified, or the system charset.
Defined to a narrow string literal that represents the current encoding of
narrow string literals, e.g., ``"hello"``. This macro expands to the text
encoding specified by ``-fexec-charset`` if any, or a system-specific default
otherwise: ``"IBM-1047"`` on z/OS and ``"UTF-8"`` on all other systems.


``__clang_wide_literal_encoding__``
Defined to a narrow string literal that represents the current encoding of
Expand Down
3 changes: 3 additions & 0 deletions clang/include/clang/Basic/LangOptions.h
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,9 @@ class LangOptions : public LangOptionsBase {
bool AtomicFineGrainedMemory = false;
bool AtomicIgnoreDenormalMode = false;

/// Name of the exec charset to convert the internal charset to.
std::string ExecCharset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets call that a TextEncoding consistently (replacing all instances of Codepage and Charset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should replace all these uses of Charset when the option name fexec-charset already has charset in it? Or do we only want the option to have that name and use encoding internally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option should stay -fexec-charset for GCC compat, but internally we would talk about ext encoding consistently.


LangOptions();

/// Set language defaults for the given input language and
Expand Down
7 changes: 7 additions & 0 deletions clang/include/clang/Basic/TokenKinds.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,13 @@ inline bool isLiteral(TokenKind K) {
isStringLiteral(K) || K == tok::header_name || K == tok::binary_data;
}

/// Return true if this is a utf literal kind.
inline bool isUTFLiteral(TokenKind K) {
return K == tok::utf8_char_constant || K == tok::utf8_string_literal ||
K == tok::utf16_char_constant || K == tok::utf16_string_literal ||
K == tok::utf32_char_constant || K == tok::utf32_string_literal;
}

/// Return true if this is any of tok::annot_* kinds.
bool isAnnotation(TokenKind K);

Expand Down
5 changes: 5 additions & 0 deletions clang/include/clang/Driver/Options.td
Original file line number Diff line number Diff line change
Expand Up @@ -7247,6 +7247,11 @@ let Visibility = [CC1Option, CC1AsOption, FC1Option] in {
def tune_cpu : Separate<["-"], "tune-cpu">,
HelpText<"Tune for a specific cpu type">,
MarshallingInfoString<TargetOpts<"TuneCPU">>;
def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
HelpText<"Set the execution <charset> for string and character literals. "
"Supported character encodings include ISO8859-1, UTF-8, IBM-1047 "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too long for HelpText, which is displayed by clang --help. Also, aren't ISO8859-1, UTF-8, and IBM-1047 all covered by "those supported by the host icu or iconv library"? Do we really need to call them out separately?

Maybe something like "Use for string and character literals" could work here as help text? You can put longer information with examples of character set names and references to icu / iconv into a separate DocBrief argument.

"and those supported by the host icu or iconv library.">,
MarshallingInfoString<LangOpts<"ExecCharset">>;
def target_cpu : Separate<["-"], "target-cpu">,
HelpText<"Target a specific cpu type">,
MarshallingInfoString<TargetOpts<"CPU">>;
Expand Down
36 changes: 36 additions & 0 deletions clang/include/clang/Lex/LiteralConverter.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
//===--- clang/Lex/LiteralConverter.h - Translator for Literals -*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_CLANG_LEX_LITERALCONVERTER_H
#define LLVM_CLANG_LEX_LITERALCONVERTER_H

#include "clang/Basic/Diagnostic.h"
#include "clang/Basic/LangOptions.h"
#include "clang/Basic/TargetInfo.h"
#include "llvm/ADT/StringMap.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/TextEncoding.h"

enum ConversionAction { NoConversion, ToSystemCharset, ToExecCharset };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have FromOrdinaryLiteralEncoding and ToOrdinaryLiteralEncoding instead of System/Exec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names here were chosen to represent the different encodings that need to be used based on the context of the string literal which I mentioned in my RFC and have also copied the table to the description of this PR. I'm not sure the proposed names make that context clear


class LiteralConverter {
llvm::StringRef InternalCharset;
llvm::StringRef SystemCharset;
llvm::StringRef ExecCharset;
llvm::StringMap<llvm::TextEncodingConverter> TextEncodingConverters;

public:
llvm::TextEncodingConverter *getConverter(const char *Codepage);
llvm::TextEncodingConverter *getConverter(ConversionAction Action);
llvm::TextEncodingConverter *createAndInsertCharConverter(const char *To);
void setConvertersFromOptions(const clang::LangOptions &Opts,
const clang::TargetInfo &TInfo,
clang::DiagnosticsEngine &Diags);
Comment on lines +31 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer, for example a static fuction that returns an optional or a null pointer failure, and let the caller call deal with error

};

#endif
19 changes: 11 additions & 8 deletions clang/include/clang/Lex/LiteralSupport.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,13 @@
#include "clang/Basic/CharInfo.h"
#include "clang/Basic/LLVM.h"
#include "clang/Basic/TokenKinds.h"
#include "clang/Lex/LiteralConverter.h"
#include "llvm/ADT/APFloat.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/DataTypes.h"

#include "llvm/Support/TextEncoding.h"
namespace clang {

class DiagnosticsEngine;
Expand Down Expand Up @@ -233,6 +234,7 @@ class StringLiteralParser {
const LangOptions &Features;
const TargetInfo &Target;
DiagnosticsEngine *Diags;
LiteralConverter *LiteralConv;

unsigned MaxTokenLength;
unsigned SizeBound;
Expand All @@ -246,18 +248,19 @@ class StringLiteralParser {
StringLiteralEvalMethod EvalMethod;

public:
StringLiteralParser(ArrayRef<Token> StringToks, Preprocessor &PP,
StringLiteralEvalMethod StringMethod =
StringLiteralEvalMethod::Evaluated);
StringLiteralParser(
ArrayRef<Token> StringToks, Preprocessor &PP,
StringLiteralEvalMethod StringMethod = StringLiteralEvalMethod::Evaluated,
ConversionAction Action = ToExecCharset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need Conversion at all.
I would expect that any ordinary, non-unevaluated literal would be encoded and the
LiteralConverter should be the same for all strings so it can live in Preprocessor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I put up a table on my old RFC https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512 which shows the different encoding needed depending on the context. So not all strings will be translated to the execution encoding. A separate PR after this one will be needed to ensure we use the correct encoding depending on the context.

StringLiteralParser(ArrayRef<Token> StringToks, const SourceManager &sm,
const LangOptions &features, const TargetInfo &target,
DiagnosticsEngine *diags = nullptr)
: SM(sm), Features(features), Target(target), Diags(diags),
MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),
ResultPtr(ResultBuf.data()),
LiteralConv(nullptr), MaxTokenLength(0), SizeBound(0), CharByteWidth(0),
Kind(tok::unknown), ResultPtr(ResultBuf.data()),
EvalMethod(StringLiteralEvalMethod::Evaluated), hadError(false),
Pascal(false) {
init(StringToks);
init(StringToks, NoConversion);
}

bool hadError;
Expand Down Expand Up @@ -305,7 +308,7 @@ class StringLiteralParser {
static bool isValidUDSuffix(const LangOptions &LangOpts, StringRef Suffix);

private:
void init(ArrayRef<Token> StringToks);
void init(ArrayRef<Token> StringToks, ConversionAction Action);
bool CopyStringFragment(const Token &Tok, const char *TokBegin,
StringRef Fragment);
void DiagnoseLexingError(SourceLocation Loc);
Expand Down
3 changes: 3 additions & 0 deletions clang/include/clang/Lex/Preprocessor.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#include "clang/Basic/TokenKinds.h"
#include "clang/Lex/HeaderSearch.h"
#include "clang/Lex/Lexer.h"
#include "clang/Lex/LiteralConverter.h"
#include "clang/Lex/MacroInfo.h"
#include "clang/Lex/ModuleLoader.h"
#include "clang/Lex/ModuleMap.h"
Expand Down Expand Up @@ -162,6 +163,7 @@ class Preprocessor {
std::unique_ptr<ScratchBuffer> ScratchBuf;
HeaderSearch &HeaderInfo;
ModuleLoader &TheModuleLoader;
LiteralConverter LiteralConv;

/// External source of macros.
ExternalPreprocessorSource *ExternalSource;
Expand Down Expand Up @@ -1224,6 +1226,7 @@ class Preprocessor {
SelectorTable &getSelectorTable() { return Selectors; }
Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }
llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }
LiteralConverter &getLiteralConverter() { return LiteralConv; }

void setExternalSource(ExternalPreprocessorSource *Source) {
ExternalSource = Source;
Expand Down
17 changes: 13 additions & 4 deletions clang/lib/Driver/ToolChains/Clang.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Path.h"
#include "llvm/Support/Process.h"
#include "llvm/Support/TextEncoding.h"
#include "llvm/Support/YAMLParser.h"
#include "llvm/TargetParser/AArch64TargetParser.h"
#include "llvm/TargetParser/ARMTargetParserCommon.h"
Expand Down Expand Up @@ -7589,12 +7590,20 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
<< value;
}

// -fexec_charset=UTF-8 is default. Reject others
// Set the default fexec-charset as the system charset.
CmdArgs.push_back("-fexec-charset");
CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
if (Arg *execCharset = Args.getLastArg(options::OPT_fexec_charset_EQ)) {
StringRef value = execCharset->getValue();
if (!value.equals_insensitive("utf-8"))
D.Diag(diag::err_drv_invalid_value) << execCharset->getAsString(Args)
<< value;
llvm::ErrorOr<llvm::TextEncodingConverter> ErrorOrConverter =
llvm::TextEncodingConverter::create("UTF-8", value.data());
if (ErrorOrConverter) {
CmdArgs.push_back("-fexec-charset");
CmdArgs.push_back(Args.MakeArgString(value));
} else {
D.Diag(diag::err_drv_invalid_value)
<< execCharset->getAsString(Args) << value;
}
}

RenderDiagnosticsOptions(D, Args, CmdArgs);
Expand Down
4 changes: 4 additions & 0 deletions clang/lib/Frontend/CompilerInstance.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
#include "clang/Frontend/Utils.h"
#include "clang/Frontend/VerifyDiagnosticConsumer.h"
#include "clang/Lex/HeaderSearch.h"
#include "clang/Lex/LiteralConverter.h"
#include "clang/Lex/Preprocessor.h"
#include "clang/Lex/PreprocessorOptions.h"
#include "clang/Sema/CodeCompleteConsumer.h"
Expand Down Expand Up @@ -535,6 +536,9 @@ void CompilerInstance::createPreprocessor(TranslationUnitKind TUKind) {

if (GetDependencyDirectives)
PP->setDependencyDirectivesGetter(*GetDependencyDirectives);

PP->getLiteralConverter().setConvertersFromOptions(getLangOpts(), getTarget(),
getDiagnostics());
}

std::string CompilerInstance::getSpecificModuleCachePath(StringRef ModuleHash) {
Expand Down
12 changes: 8 additions & 4 deletions clang/lib/Frontend/InitPreprocessor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1057,10 +1057,14 @@ static void InitializePredefinedMacros(const TargetInfo &TI,
}
}

// Macros to help identify the narrow and wide character sets
// FIXME: clang currently ignores -fexec-charset=. If this changes,
// then this may need to be updated.
Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
// Macros to help identify the narrow and wide character sets. This is set
// to fexec-charset. If fexec-charset is not specified, the default is the
// system charset.
if (!LangOpts.ExecCharset.empty())
Builder.defineMacro("__clang_literal_encoding__", LangOpts.ExecCharset);
else
Builder.defineMacro("__clang_literal_encoding__",
TI.getTriple().getSystemCharset());
if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
// FIXME: 32-bit wchar_t signals UTF-32. This may change
// if -fwide-exec-charset= is ever supported.
Expand Down
1 change: 1 addition & 0 deletions clang/lib/Lex/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ add_clang_library(clangLex
InitHeaderSearch.cpp
Lexer.cpp
LexHLSLRootSignature.cpp
LiteralConverter.cpp
LiteralSupport.cpp
MacroArgs.cpp
MacroInfo.cpp
Expand Down
69 changes: 69 additions & 0 deletions clang/lib/Lex/LiteralConverter.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
//===--- LiteralConverter.cpp - Translator for String Literals -----------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#include "clang/Lex/LiteralConverter.h"
#include "clang/Basic/DiagnosticDriver.h"

using namespace llvm;

llvm::TextEncodingConverter *
LiteralConverter::getConverter(const char *Codepage) {
auto Iter = TextEncodingConverters.find(Codepage);
if (Iter != TextEncodingConverters.end())
return &Iter->second;
return nullptr;
}

llvm::TextEncodingConverter *
LiteralConverter::getConverter(ConversionAction Action) {
StringRef CodePage;
if (Action == ToSystemCharset)
CodePage = SystemCharset;
else if (Action == ToExecCharset)
CodePage = ExecCharset;
else
CodePage = InternalCharset;
return getConverter(CodePage.data());
}

llvm::TextEncodingConverter *
LiteralConverter::createAndInsertCharConverter(const char *To) {
const char *From = InternalCharset.data();
llvm::TextEncodingConverter *Converter = getConverter(To);
if (Converter)
return Converter;

ErrorOr<TextEncodingConverter> ErrorOrConverter =
llvm::TextEncodingConverter::create(From, To);
if (!ErrorOrConverter)
return nullptr;
TextEncodingConverters.insert_or_assign(StringRef(To),
std::move(*ErrorOrConverter));
return getConverter(To);
}

void LiteralConverter::setConvertersFromOptions(
const clang::LangOptions &Opts, const clang::TargetInfo &TInfo,
clang::DiagnosticsEngine &Diags) {
using namespace llvm;
SystemCharset = TInfo.getTriple().getSystemCharset();
InternalCharset = "UTF-8";
ExecCharset = Opts.ExecCharset.empty() ? InternalCharset : Opts.ExecCharset;
// Create converter between internal and system charset
if (InternalCharset != SystemCharset)
createAndInsertCharConverter(SystemCharset.data());

// Create converter between internal and exec charset specified
// in fexec-charset option.
if (InternalCharset == ExecCharset)
return;
if (!createAndInsertCharConverter(ExecCharset.data())) {
Diags.Report(clang::diag::err_drv_invalid_value)
<< "-fexec-charset" << ExecCharset;
}
}
Loading
Loading